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Microarchitecture Description 

1 Problem Statement 

As the level of parallelism in the instruction stream increases so does the number of 
access ports required to a centralized register file. They are required to provide operands 
to and write back results from all the active functional units. The complexity of the register 
file grows at approximately N 3 where N is the number of access ports. The register file 
soon becomes the bottleneck in the design and starts to have a strongly detrimental affect 
on the maximum clock speed. 

This scalability issue is further hampered by the need to provide an extensive network of 
feed-forward buses between the various access ports. In order to achieve high clock 
speeds, register read and write operations are typically designed as being in different 
stages within the pipeline. However, in order to achieve high code performance it is a 
requirement that an instruction can be pass it results onto an immediately following 
instruction. Such an instruction is executed just one dock cycle later (presuming the 
instruction only takes one clock cycle to perform). This requires that the register file can 
detect reads and writes being performed on the same clock cycle to the same register 
and provide special forwarding buses to directly transfer the data to the reading unit 
without having to write to the register file first. Given that the number of access ports and 
the feet that every write port has to be compared against every read port, this creates a 
very challenging circuit design. Moreover, it is running within the critical path of the 
processor pipeline and has a direct impact on maximum clock frequency for whole 
processor. 

2 Prior Art 

Some VLIW solutions have adopted a clustered approach to help alleviate this issue. In 
this model the functional units are partitioned into clusters, each having a private register 
file. Communication between clusters requires one additional clock cycle of latency. Thus 
performance suffers if there is significant communication between clusters. Code 
generation for such machines seeks to minimise the number of data transfers between 
clusters. 

Another approach is that undertaken within the field of Transport Triggered Architectures 
(TTA). Code for TTAs controls transports rather than operations. That is, the instruction 
set specifies how data items are moved around the machine to different functional units. It 
is transport rather than operation centric in nature. By explicitly managing the transport of 
data between functional units and the register file, a TTA is able to reduce the total 
number of access ports required to the register file. Moreover, a TTA explicitly schedules 
the transport of data over feed-forward buses and thus avoids the need for complex 
register number comparison logic. 

3 Summary of Contribution 

CriticalBlue employs a two-tier register file approach. There is a main register file but it 
has a very limited number of access ports. The code generator seeks to minimise the 
number of register file accesses by passing data values directly between functional units 
and intermediate holding registers without passing them through the register file. 
Moreover, reads and writes to the register file are explicitly generated by the code 
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generator just like any other operation. The register file is treated like any other functional 
unit in the processor and has no special status. This is in common with the TTA approach. 

Each functional unit has output registers for holding its results. Operands for functional 
units are obtained via multiplexers that select results from a number of different result 
registers. This effectively forms a crossbar switch between the results from functional 
units and the inputs to them. The CriticalBlue machine code represents the selection 
settings for these multiplexers on each clock cycle. Thus rather than specifying a register 
number, where an operand is to be read or written, it specifies the bus on which a 
particular data item is available. The code generator is aware of the structure of the buses 
in the processor and controls them alongside the functional units themselves. The 
majority of data values are passed from functional unit to functional unit without every 
passing through the register file. 

If every functional unit could read from any result register then the problems of the 
centralized register file would return, due to the level of connectivity to the multiplexers. 
One of the key aspects of the architectural synthesis is that the data paths are optimized 
so that the crossbar architecture is only sparsely populated. 

Only a basic connectivity network is provided by default. This allows any result to be 
passed to any operand. However, this may require the value to pass through a number of 
intermediate holding registers and thus take several clock cycles for the transport to 
complete. 

The default connectivity is optimised in order to minimise the average number of clocks 
required to transport a data item from one functional unit to another. This is done 
analyzing the data flow in a particular application, especially those parts of the application 
that are most critical to its overall performance. If the program requires frequent 
communication between particular functional units then this is reflected in the connectivity 
network. A direct connection is created between the functional units or the transport 
distance, in terms of the number of intermediate holding registers, is reduced. Thus the 
bus connectivity within the processor becomes optimised for a particular end application. 



4 Architectural Overview 

4.1 General Philosophy 

One of Ihe key requirements of the architecture is to support scalable parallelism. The 
basic structure of the micro architecture and the operation of the design tools are all 
focused on that goal. 

Extracting parallelism from highly numeric loop kernels is relatively straightforward. Such 
loops have regular computation and access patterns that are easy to analyse. The nature 
of the algorithms also tends to lend itself well to parallel computation. The architecture just 
needs to balance the availability of computational resources (such as adders, multipliers) 
and memory units to ensure the right degree of parallelism can be extracted. Such 
numeric kernels are common for DSPs. The loops tend to lack any complex control flow. 
Thus DSPs tend to be highly efficient at regular computation loops but are very poor at 
handling code with more complicated control flow. 

Other than in numeric computation loops, C and C++ code tends to be filled with 
complicated control flow structures. This is simply because most control code is filled with 
conditional statements and short loops. Most C++ code is also filled with references to 
main memory via pointers. The result is a code stream from which it is extremely difficult 
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to extract useful amounts of parallelism. In average RISC code, approximatelv 30% of all 
.nstruct.ons are memory references and a branch is encountered eSJSlSrSfc^ 

General purpose processors for PCs have to deal with this kind of code and extract 
parallels from it in order to achieve competitive performance. The cSmplexrtv ofPC 
processors has mushroomed in recent years to try and deal with tiiis issuT S 5 JS 
logic for a modem PC processor has literally millions of tran^stoTdScat^e tosk 0 
extracting parallelism from the code being executed. The extra hardware needTd I to 
actually perform the operations in parallel is tiny in comparison with toetoaic reouiiS o 
find and control them. The main method utilised is to support dynamic 5E£S£?i5 

Z£ " from ^^"-Jh 3 f°T ^ processor t0 execute instmSonsT? d^renl 
order from that specified by the program. It can also execute those inst^chons 
speculatively, before it knows for sure whether they should be executed I a° all Si 
para le sm to be extracted across branches. The difficult constraint I fe toat the ^execSn 
must always produce exactly the same answer as would result if the hstructSrwIr2 
executed strictly one by one in the original order. instructions were 

for e aSS.B?iTS X ity ™ erheads of <%™™ c out-of-orxler execution are far too high 
hv, fhfSSf Processor. There is a significant cost overhead due to the area occupied 

ampn fl hTtr t hL^ ,C ', n K 1° menti ° n me 0081 0f desi 9 nin 9 *• Additionally, such fogte fe not 
amenable to the scalability requirements of the CriticalBlue architecture 

ClT^I^^^'T 16 " 18 in the area of m, ' cro architecture have been focused on 
^Znnf^ r r&S - ,^ ere iS a ^ aGk to basics " movement that seeks to placTthe 
burden of extracting parallelism on the compiler. The compiler is able to perform much 

Snn a th a,yS,S t0 T*i Paral,e,iSm in a PP ,ication - 11 is <*» consUrab* 2o?*t 
SXI " fh U,Va,ent ^ '° giC - ™ S is beCause the °° n ™ logic ZS find I the 
paralleHsm as the program is mnning so must itself be highly pipelined and suffers from 

SSSSTSSTT 8 ° f Ci f Uit d6Sign - 1716 COmpHer ^ormsaW of its work^ from h 
software with the luxury of much longer analysis time. For most classes of s£t£ 

parallelism, compiler analysis is very effective. tatlC 

H^rminfw Sf' SOf ^T f naly8is is poor at parallelism that can only be 

determined dynamically. Examples of these are branches and potentially aliased memor* 

S^b^^^ ^ *? Probabi,ity that a P^icu.ar P branch K'aken from 
1 °. ' rt . Cann ° t kn ° W for SUre whether rt *■ be tak ^ on any particular 
instance A compiler can also tell from profiling that two memory accesses never seem to 
access the same memory location, but it cannot prove that will alw?ys be toe oaU 
Consequently it is not able to move a store operation over a potSv aliased^ 
operation as that might affect the results the program would' geLrate This reSril ^ 

dTnScaV ara,,e,ism that ™ be extracteJ rstatica,,y in !5S?rr2a2 

CriticalBlue employs a unique combination of static and dynamic parallelism extraction 
LmZ e h "J architec I ur f access to "igh degrees of parallelism wLTthe ovlSd Z 
Sl^T? 1 * 0 ' structures. The software tools perform all the analysis moving 
^«^ P Hl eX,ty bU i den 3W f y *° m 016 hardwara ^ CriticalBlue architecture itse If fe S 
22? ^ I ?■ m0deL 11 6XeCUteS instructi °n* ^ exactly the order specified by toe 
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nardware s able to detect if the assumptions are wrong and arrange a re-execution of the 

S^MT^^i^ thS 0011861 answer ' n a " circumstanS Sar^e 
overhead for this ''hazard" detection and re-execution is very small. naraware 

JrchitectureT b6l0W illUStrateS thS P0W6r that thiS Sty,e 0f execut, * on 9 ives *» CriticalBlue 



Segment of Code 

Basic Block 
of Code 




Actual Path 
of Execution 



Conventional CPU 



© 
o 
© 



Strand Based Execution 

&;::£) O 
GC, O O 



Original code 
contains many 
possible paths 



Conventional CPU 
executes blocks 
sequentially 



Strand parallelism allows 
some blocks to be performed 
concurrently, reducing 
execution time. 

An example segment of code is shown on the left hand side. This is comoosed of 
f ,C b °? kS - A . baSiC b ' 0Ck iS 3 S69ment of °°* is dented b y a bSnch 

AUhe en ^?S H S^?E t ^ SiC b ' 0Ck ? 6n a " inStnJCtions ™' thin « ™» ^executed 
At tne end of the basic block there is a branch instruction that causes execution tn 

"5 ° r 2°. different P 08815 ' 6 successor block *- SSSn^S? which 
the branch s performed is generally calculated in the code of the basic block so ft Tnot 
possfole to know before entering the block which route will be ikaT&S^aS^^ 
toe code will produce a particular path of execution through the baste bbSTStah 
paths may be considerably more likely than others but any route may be taken 

The middle section of the diagram shows the execution in a conventional processor that 

^^LSC^ f d *, ° U k ° f ° rd8r eXeCUti ° n - ™ 8 18 wSSTrSS ^proSsso 
resdved 33 t0 b6 6XeCUted ° ne after the other - as »» branches are 

The right hand section shows the execution model for CriticalBlue. It is able to execute 
code from a number of different basic blocks in parallel. It does tfiis tc » increSS tt2 
amount of parallelism and efficiency of the architecture. It might knowftat one 
rs very likely to follow another. It can pull instructions forward second block ^to 

eSlndSI ^ ^ in ? trU ? i0nS ° f firSt ™ S a,,ows SuSonfto Se^frted 
earlier (and thus finish earlier) and to more effectively balance the resource utilisation of 

SSSSKT; I" CriticalBlue 8Chedu,ing a,gorithm does «! 

probability that the execution of a particular instruction will be useful. 

Executing code speculatively in this manner normally requires a significant hardware 
overhead. If a particular block should not have been^ecLiteftSen ^yl ZSTSZ 
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produced must be discarded. This is referred to as "squashing" the execution. In 
particular, any store operations that the code has performed must be undone as they 
could permanently pollute the memory space with incorrect results. CriticalBlue employs 
mechanisms that allow the benefits of speculative execution while only requiring the 
minimum of hardware overhead. 

4.2 Functional Units 

The internal architecture of functional unit is shown below: 




The central core of a functional unit is the execution unit itself. It performs the particular 
operation for the unit. New functional units may be created using user defined execution 
units. The CriticalBlue tools automatically instantiate the required "glue" blocks around the 
execution unit in order to form a functional unit. These glue blocks allows the functional 
unit to connect to other units and to allow the unit to be controlled from the instruction 
word. 



Functional units are placed within a virtual array arrangement. The dashed lines shown on 
the diagram illustrate this. Individual functional units can only communicate with near 
neighbours within this array. This spatial layout prevents the architectural synthesis 
generating excessively long interconnects between units that would significantly impact 
clock speed. 

The control inputs include the segment of the instruction word that controls that particular 
functional unit. The method selector is passed directly to it. Other fields are used to control 
the multiplexers that select data from buses. Each operand input to the execution unit 
may be chosen from one of a number of potential data buses using a multiplexer. In some 
circumstances the operand may be fixed to a certain bus, removing the requirement for a 
multiplexer. The number of selectable sources and the choice of particular source buses 
are under the control of the CriticalBlue architectural optimisation tools. 

All results from an execution unit are held in independent output registers. These drive 
data on point-to-point buses connected to other functional units. The point-to-point nature 
of these buses minimises power dissipation and propagation delays. Data is passed from 
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one functional unit to another in this manner. The output register holds the same data until 
a new operation is performed on the functional unit that explicitly overwrites the register. 



4J3 Communication Architecture 

Although the CriticalBlue architecture does have a central register file it is treated like any 
other implicit functional unit. All accesses to the register file have to be explicitly scheduled 
as separate operations. Since the register file acts like any other functional unit its 
bandwidth is limited. The code is constructed so that the majority of data values are 
communicated directly between functional units without being written to the register file. 

Traditional architectures have a centralised register file that has customized access ports 
to all of the functional units. Access to the register file is implicit in the instruction layout 
and semantics of the instruction set. The register file is used to feed the operands of the 
execution units and hold the results generated by them. Unfortunately such a centralised 
register file imposes a significant restriction on scalability. As the level of parallelism in the 
instruction stream increases so does the number of access ports required to a centralised 
register file. These are needed to provide operands to and write back results from all the 
active execution units. The complexity of the register file grows at approximately N 3 where 
N is the number of access ports. The register file soon becomes the bottleneck in the 
design and starts to have a strongly detrimental affect on the maximum clock speed. 

Given the requirement to make CriticalBlue highly scalable, communication of all data 
through a centralised register file is not a viable architectural option. Whenever a 
functional unit generates a result it is held in an output register until explicitly overwritten 
by a subsequent operation issued to the unit. During this time the functional unit to which 
the result is connected may read it. 

A single functional unit may have multiple output registers. Each of these is connected to 
different functional unit or functional unit operand. Since all connection buses are single 
source/sink there is a one-to-one correspondence between output registers and 
connections. The output registers that are overwritten by a new result from a functional 
unit are programmed as part of the instruction word. This allows the functional unit to be 
utilised even if the value from a particular output register has. yet to be used. It would be 
highly inefficient to leave an entire functional unit idle just to preserve the result latched on 
its output. In effect each functional unit has a small, dedicated, output register file 
associated with it to preserve its results. 

An example functional unit array is shown in the diagram below: 
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Given the connectivity limitations of the functional unit array, not every unit is connected to 
every other. Thus in some circumstances a data item may be generated by Se unhand 
needs to be transported to another unit with which there is no direct connection f^e 
placement of the units and the connections between them is soecHtean? ™dS n^H tn 
minimise fte number of occasions on which this o4 Ure Z> ISSZSSol iX£ 
opt.rn.sed for the data flow that is characteristic of the required applic^n^e 

To allow the-transport of such data items, any functional unit may act as a repeater That 

e , may te . several routes to ■» same destina «°n -TS«*SSSS; 

selecte the most appropriate route depending upon other operations being performed ?n 

There are underlying rules that govern how functional units can be connected together 

H^l?™?^ 8 ^ pri T arily driven by *» P^ominate data flows betS .iff unfe 

SJ^T^^J^ a " and h * e fUnctional unit array a^ y 

reacnaoie. That is, any result can reach any operand via a path throuah the arrav .^inn 
unite as repeaters. These rules ensure that any code sequence toTrtES 

S? 2 frST? 4 ^ ° f °°* aerated J2^d£2S2 

how well the data flows match the general characteristics of the application Code that 
represents a poor match will require much more use of repeating ^ghm^ 

CriticalBlue uses a truly distributed register file concept. The main reqister file does not 
form a central bottleneck that would severely hamper the scalabL of anohiSure 
The majority of communication in the CriticalBlue architecture occurc dYn^fbetweSi 

registers local to each functional, unit hold data values until thev are reauired bv 
subsequent operations. The connections between the functional unite an MtaJXSJrt 
are optim.sed synergistic^ to adapt to the data flow present in the appTcation 
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4A Instruction Representation 

CriticalBlue uses a Very Large Instruction Word (VLIW) format. This enables many 
parallel operations to be initiated on a single clock cycle, enabling massive parallelism. 
The actual width is not fixed by the architecture and is under user control. Shorter widths 
tend to be more efficient in terms of code density but poorer in extracting parallelism from 
the application. 

The instruction format is not fixed either and is dependent upon the execution units the 
user defines for a particular processor. Unlike many contemporary VLIW architectures, 
CriticalBlue uses a flat decode structure. This means that a particular execution unit is 
always controlled from a specific group of bits in the instruction word. This makes the 
instruction decoding for the architecture very straightforward. Other VLIWs tend to just 
bundle a number of independent operations into a single instruction word. They still 
require quite complex decode logic to direct different operations to the appropriate 
execution units. 

The diagram below illustrates the basic instruction decode and control paths of a 
CriticalBlue processor: 



Instruction Memory 



r 



101011111011101010 
10100 10101010 10001 
011011001010101001 
111010100101010011 
101011100111011011 



Long Instruction Word for User Defined ISA 



*v 101011 
J Unit 1 Control 



101011 
Unit 2 Control 




101011 
Unit N Control 



Mux 



vvv 

UnitN 



User defined execution units .(automatically replicated and 
connected by tools) 



The instruction (or code) memory holds the representation of the operations in the 
customized format for the processor. A new instruction word is fetched on each clock 
cycle. Each block of bits in the instruction word is used for controlling a particular 
execution unit. 



The bits in the instruction word are used to control multiplexers that direct data from the 
interconnection network to the operand inputs of the execution unit. A further field selects 
the particular method that a unit should execute. Results from the execution units are 
routed back to the interconnection network to be used by subsequent operations. 

A branch control unit allows the architecture to execute new blocks of code by loading a 
new value in the PC (Program Counter). If a branch is not executed then the PC is just 
incremented on each cycle to execute code sequentially from the code memory. 
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The diagram represents a slight simplification of how the architecture actually operates 
but demonstrates the key features. In particular, the instruction word layout is not 
completely flat. If it were then the width of the instruction word would grow with the 
number of execution units in the system, potentially reaching unwieldy widths. The 
representation would also be highly inefficient as a number of execution units will 
generally be unused on each cycle, and thus the bits controlling them would be wasted. 

The architecture actually allows a number of execution units to be controlled from the 
same block of bits in the instruction word. An operation code field selects which of the 
execution units is selected. The design tools analyse the usage coincidence of different 
execution units. Units that are rarely used on the same cycle are allocated to the same 
group of bits in the instruction word, improving code density with minimal impact on 
performance. 



4.5 Strand Execution Model 

One of the central innovations of the CriticalBlue architecture is its "strand" based 
execution mechanism. These are rather like threads but represent a much lower level 
construct that is present in the architecture to support out-of-order execution. 

A strand represents a particular sequential group of operations that is being executed on 
the machine. Many strands may be executed simultaneously. Each individual operation 
that is performed belongs to a particular strand. Whenever an instruction word is executed 
it may contain operations that associated with a number of different strands. 

The diagram below illustrates the relationship between strands and basic blocks: 



Selected Region 



j==j Strand i 



Strand 2 



Strand 3 □ 




Sirand 4 



□strand 5 
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The original basic blocks Impose a 
order on the code. Operations In an 
strand must not affect an earlier 
one. 
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To allow parallelism the 
schedule mixes up instructions 
from different strands. Strand 
numbers are shown In the 
boxes. 



Causal Order 

Straid 1 
Strind 2 
Strand 3 
Str*jid4 
Strand 5 



But the correct order 
of memory accesses 
must be maintained. 



The different colours on the left hand side of the diagram represent different basic blocks 
in the code. The individual blocks represent the individual operations that are present in 
each of the basic blocks. The last operation is a branch that determines which basic block 
will be executed next. Each of the basic blocks is allocated a different strand number. 
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VSJSSH f Cti ° n ° f ^ dia ^ ram Sh0WS a P 0 ***' 8 ' instruction schedule generated for a 
CntcalBlue processor. Operations from different strands (i.e. basic blocks) may be feSed 

oZ 9 2 e n same , dock c £ ,e - ™ e ori 9 inal ^nd number is shown in each o^eralon ^e 
nnplioL °Ef r th ', n a , PartiCU,ar Strand is alwa y s maintained. The order of 
c?hS ? SfT" f 8 ™ 18 d0SS not have to be maintained, allowing much greater 
S252„^ 8 ? m H fer K 1 5 e arcnitecture - ^ough the scheduler is SbteTSSS 
? ° b t^ een StrandS * Wil1 on, y d0 80 tf »* is ^likely to lead to a 
to^o nolo dWare 19 t0 reCOV6r fr ° m 3 hazard but »»» is a Performance penalty 



This mechan.sm allows instructions to be issued out of order. However if the correct 
resurts are to be produced by the architecture then the data flows Seen sLnSS 
would occur/ they were executed in the correct order must be maWaSTSie^hlS 
part of the diagram shows that the logical order of the strands is a^ foe same A result 

Pn st?nf 2 ThTt XffS f ^ \ Sh ? M never influence caCa«oTpe5oS 
in strand 2-That could never happen if the instructions were executed in order In effect 
the architecture has two different time domains. There is the physica! time domair of foe 
order of .nstmction execution. There is also the logical time SomatT thS was ^resfnt in 
sema°S *"* be in ° ?der to maintai " *■ oSS- pr^ram 

The CriticalBlue tools can determine the correct ordering of most operations statically 

d^SS P °ifolL Ws is memory operations - ^ - -JESS'S 

4*6 Region Based Execution 

In the CrificalBlue architecture all execution is performed within blocks of code called 
^T^^^X^ * "! - action JUSTUS 

SMS an^ t^rme^ 

blocks bi o reg.ons. Regions are often used as the basic arena in wh gSSfSawfcSta 
optinnsatons are performed. Global scheduling refers to the mcSS^SSfi 
across branches as well as within individual basic blocks. GfobTscheduHm Tt 
considerably more difficult problem than basic block scheduling. scheduling is a 

In the CriticalBlue architecture, regions are always executed folly If the reoion contain* * 
number of internal branches to basic blocks outside of th^gtoTlS foe^aX not 
resolved until the end of the region reached. The compiler constructs the Sns from 

?Etah 3J5 K in *r most ^ ke,y execution paths the S b.ock" 

successor regions P m ^ branCh t0 Se,ect ° ne of a number of differen t 

All strands are limited to the lifetime of a single region. The architecture is able to execute 
£2E'L 0Ut of t 0rder ^ hin a P articu,ar Wn. Out of order execution and I any S 
hazards are resolved at the end of the region and then execution continues on to another 
region, which may itself issue operations out of order. «™nues on to another 

I h o e m di .f 9 !f m b f° W il,ustrates an example set of regions and the relationships between 
them. It shows the execution of the individual strands within each region: 
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If a hazard is detected during execution then the sequential semantics of the strands have 

SLSliE? 1 Pf erved 2» architecture must be able to recover from this situation 
with as little overhead as possible. 

Upon detecting a hazard in a particular strand the results generated for that and any later 
(i.e higher numbered) strands may be incorrect. The architecture allows execution to 
continue until the end of the region, when the strands will be completed. Any Sute from 

h n e ,S I* nd ♦ a ;^ h !? ler • Strands are discard ^. The architecture then re-executes the 
code from the start of the region again. Since lower numbered strands have already been 

too^X^T l eV T n0t 6XeCUted 3 S6C0nd time - ^ archrtecture incS 
logic to block operations from those strands. Since the lower strands have completed and 
generated their results the hazard strand is ableto execute correctly, utilizing any required 

haS r *5f ,0W6 r Strands - ' f another ' even ^her numbered, strand generates a 
hazard then the regron may be repeated a second time. When all strands have 
successfully completed the processor may move onto the successor region. 

Of course the goal of the CriticalBlue architecture is to execute all strands successfully on 
the first attempt. The compiler does extensive analysis to ensure that the chances of 
hazards are smalh The key is that the compiler doesn't have to prove that a hazard 
J appe . n - re - exe cution mechanism will ensure correct completion of the 
strands if required. It does this with a minimum of hardware overhead. The size of regions 
is limited to a few tens of instructions so that the overhead of any re-execution of the 
region is not too great. 
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5 Microarchitecture 

5.1 Overview 

SSufS? inT"'?* ** under| y |n 9 microarchitecture of a CriticalBlue processor It 
shows how instructions are fetched, decoded and directed tovuarrte «L « • J 

sisals 

52 Detailed Architecture 

The diagram below details the basic architecture of all CriticalBlue processors Th* 
architecture ,s split between the control units on the .eft andt?' 25kSS?S^„X 
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The fixed control units are present in all CriticalBlue processors. They provide the basic 
functionality of" the architecture that allow it to fetch and execute instructions, handle 
branches and interrupts and implement the strand execution mechanism. These units are 
embedded with CriticalBlue's intellectual property. The quantity of control logic in a 
CriticalBlue processor is very small in comparison to other embedded RISC processors. 
Moreover, the control logic is fixed and does not expand as the number of execution units 
in the processor is scaled. Other architectures suffer from exponential growth in control 
circuitry as the parallelism in the processor is increased. 

The majority of functional units in the architecture are user defined. One of the 
fundamental innovations of the CriticalBlue processor is its complete configurability, it 
provides a framework that allows users to define their own functional units. There are a 
number of implicit Junctional units that must be present. These implement instructions 
from the host instruction set or implement basic constructs such as main memory. All 
functional units are controlled from fields in the execution word. 

The functional units are organised into a virtual array structure. Neighbours or other 
spatially close functional units are able to communicate data using a network of 
customised connections. These pass data from one functional unit to another without the 
need for a centralised switch network. All data transfers are under direct control of the 
execution word. 



5J3 Execution Word Representation 

CriticalBlue is a wide instruction word machine. The actual width of the word is a user 
configurable parameter. The processor optimisation can be made to select a suitable 
width. 

The code is stored in 32 bit width words in main memory and transferred to a wider 
instruction cache (called the instruction buffer) prior to actual execution. The instruction 
cache has a certain capacity to allow particular code loops to remain cached without 
continuous access to main memory required. The wider instruction buffer can be 
configured in size to support power consumption and area goals. 
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The minimum width is 32 bits and any lamer width that is a multiple of a bvte mav be 
selected. An artificial limit of 512 bits is imposed so that the upper bound of support for tS 
.nstrucbon control unit is known. The instruction control unit feresponsibte S2£ * 

insrctioUXr" ^ fomiin9 int ° WOrds to be «oade5 intone 

CriticalBlue has many commonalities with a Transport Triggered Architecture (TTA) The 
bits wrth.n the execution word largely specify information regarding the transport of data 
between functional units. The instruction word is not divided intc opcode St?r and 
other fields as is normally with the case with RISC style processors. 

All bits within the execution word have positional context. There is a direct relation<;hin 

^S^S^S^^ fU n? i0na ' UnitS that *~* COntro '- ™* ^-5E?5E 
execution word decoding task. The appropriate bits for a particular functional unit are 
s.mply routed from the execution unit word as required. functional unit are 

The actual layout of the execution word is performed automatically from analysis of code 
This analysis exammes the frequency and size of the control bits for particuter foncS 
units and organises the instruction word to maximise parallelism. part,cu,ar functional 

The basic structure of the execution word is illustrated below: 



| Group A Group B 


Group C |GroupD| Group E | 







k rrrr is then used to indicate -* 

Iel^L m6 S hani T a .I l0W ! ? SCt trade -° ff between code ^nsity and parallelism (i.e. 
performance) independently of the functional unit selections. A narrow execution word 
forces more functional units to share groups. If more than one of those SbSddto 

Howllr" 3 Part ' CUlar Cyd ? **" t*« oma ™* ™y ^ lost as only one may be Sected 
»mn^ ; T f xacut,on word increases the chances that all groups are usefully 
employed on each clock cycle and thus code density is improved. If a wider execuS 
word is employed then greater parallelism is possible (as there is less shaZg *wSn each 
group) but groups are more likely to go unused on any particular clock cycle 

JEHUS ? *| execu " on S* fe used for controlling functional units apart from one bit. 
Son his SSffilS" 0 ( ^ fe ^ t0 indiCate 016 ' aSt 6XeCUtion word in a 

5.4 Functional Units 

1!^^°"?' UPitS reDrese u nt the building blocks of the processor. The selection of 
functional units represen s the basic configurability of the architecture. The designer is 

rriL? R i^ f , Ct .'° na l Un , itS 38 reqUired for a P articular aPP'ioation domain. The 
Cntica Blue toote automatically create the required connections between the functional 

Sid ? a . Tn.? em . m \ constituant components of a fully programmable processor. 
Ind.v.dual functional unrts may be replicated as required and the software tools 
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5.4.1 Overview 

n^l e mr 4thin - th ! ftjnctional un « is the execution unit. This is the block that actuallv 
performs the requ.red operations. The execution unit is surrounded by addMona? tafettS 
allows the execution unit to be controlled by software as part of a^EwE^^ 
communicate with other functional units. All inputs to the execution unit aeZtect^L™ 

IJES?^™^ ? d8ta bUS6S - TheS ^ buses communicate Lte b ^n me 
ri^ini i ¥ ? 1 ?J he P rocessor - ^ CriticalBlue optimisation tools ZtaScaDv 
SSTl?? and TOnnecBvit y of these buses This is done on Te basifof 

dataflow in the required applications. Outputs from the execution unit are latched and Ln 

tZ^IT a J 3ta bUS f ° r USe by another fUnctional EacrSional untt is SS 
worn nf L ^ S ° me °°T* '° 9ic t0 al,OW *» unit to be controlled from the execulon 

»ttmj.rffp^ is extracted * at se,ects ^ 

5.4.2 Constituent Blocks 

?agram: StitUent b '° CkS ° f 8 ^t^' Unit are shown in more *** in the following 
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T rt 18 thS US6r defined portion of a Actional unit. For implicit functional 
. eXe ? Ut '2" Unit """P"** fe mailable in the CriticalBlue I bZ For explfcft 
functjonal units this component has to be supplied by the user The surJouSnn 

SSSSS* ^ ^ t0 33 9 ' Ue UnitS - ^ « - ^"braS blocks in Se 
£S£?S env,ronment The Critica,Blu e tools automatically generate Ihe SnnectiviS 
between the execution units, glue units and the external environment This^oresente 
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As shown in the diagram, a functional unit has a number of external connections. On the 
left hand side are data inputs to the unit. These represent data values to be supplied to 
the execution unit. Operand selector glue units are used, under software control, to select 
data from one of a number of potential sources. Dedicated inputs allow data to be 
selected from specific sources in the architecture. These are used to connect functional 
units to control units and to enable private connections between different functional units. 

On the right hand side are the outputs from the functional unit. These are the results 
generated by the execution unit. Such outputs are directed through a number of glue 
units. Firstly, the result selectors allow data to be selected either from the execution unit or 
directly from one of the operand inputs. This allows functional units to be able to copy 
input operands to output results directly. This copy functionality is required in order to 
distribute data values around the functional unit array. The next stage is the optional run- 
out glue unit. This is only required for units with multiple cycle latencies. A single unit is 
used to capture data values as the execution unit generates them so that state is not lost 
during the transfer to an interrupt handler or if the whole pipeline is stalled. Finally, the 
output banks include registers and drive the data to other functional units. 

The inputs along the bottom of the diagram represent control inputs. Firstly a subset of the 
execution word is presented to each functional unit. This is used to control its operation 
and select data operands appropriately. The Strand Enable Mask (SEM) indicates which 
strands are currently being executed and allows operations from disabled strands to be 
ignored. A wait flag is globally distributed to functional units. If asserted it causes the 
entire pipeline to be stalled. A test port allows production test vectors to be performed on 
the unit. Finally, various control flags may be distributed to each functional unit. The enter 
and exit interrupt flags indicate when the processor is starting or ending an interrupt 
response. This allows appropriate synchronization of the output registers, which need to 
maintain previous functional unit state during the execution of an interrupt handler. 

5.4.3 Architecture Overview 

A detailed architecture overview of a functional unit is shown below. This shows the 
internal connectivity between the constituent blocks. 
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The diagram shows an execution unit with two operand ports and one output port 
However, the number of both input and output ports is completely configurable. 
The connectivity for the constituent blocks is as follows: 

□ Execution Unit: Receives operand data values from the operand selectors In 
general a particular operand will have multiple potential data sources. However if 
only one data source is required then an operand port may be directly connected 
to the external data bus, avoiding the need for an operand selector. A method 
selection is obtained from the controller unit. This is extracted directly from the 
execution word but is delayed by one clock cycle by the controller so its 
presentation is in synchronization with the associated operands. The select flaq is 

. 1 l eXe . CUti ,?? u Unit has ^ se,ected to P erform a ^w operation 
" K ^ 0< r k ^ cle - , f «» fla 9 's false then the method and operand inputs are 
undefined. Output results are connected directly to the result selector, run-out and 
output bank data path. 

□ Controller: The controller reads the opcode portion of the execution word and 
compares the code against the fixed selection code for the unit. If there is a match 
then the unit is being selected. The Strand Enable Mask (SEM) is shows the 
status of each strand. The strand to which the operation belongs is specified as 
part of the execution word. An operation is only performed if that strand particular 
strand is active. Finally, the wait flag is used to indicate a pipeline stall and is used 
to prevent further operations being issued to the unit. 

□ Operand Selector: An operand selector is a simply a multiplexer for selecting 
one of a number of data inputs from data buses. The selected bus is passed to an 
operand input of the execution unit. A portion of bits from the execution word is 
used to specify the bus to be selected. These bits are latched so that they are 
delayed by one dock cycle. This causes the data steering to be performed during 
the execution cycle rather than the decode cycle. 

□ Delay Pipeline: The delay pipeline simply delays a number of data and control 
signals by a number of clock cycles. The delay period is one less than the latency 
of the exection unit. Thus if the execution unit has a latency of one then no delay 
pipeline is required. The input operands are delayed so that they are available to 
the result selector during the correct clock cycle. The result selector is itself 
delayed in the unit. The output mask field is delayed so that it is available to 
control the output banks during the correct clock cycle. The selection output is 
delayed and is used to indicate if valid output is being generated by the execution 
unit dunng a particular clock cycle. The control pipeline also generates an optional 
output used to gate the clock to the execution unit when it is not performing any 
operations. ** ' 

□ Result Selector: The result selector is simpy a multiplexer used to select data 
from either the result port of the execution unit or from one of the operand inputs 
Under normal operation data is selected from the execution unit. If the unit is 
being used to perform a copy operation then the execution unit itself is not 
activated and data is selected from one of the operands. This allows the unit to 
copy its input from any of its inputs in order to distributed data values throuqhout 
the functional unit array as required. 

□ Run-Out: This unit is only required if the execution unit has a latency of greater 
than one and interrupts or pipeline stalls are supported in the architecture 
Normally the Run-Out requires an additional clock cycle itself so the effective 
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latency of the functional unit is increased accordingly. The Run-Out unit is an 
intelligient FIFO that is able to capture data generated by the execution unit. 
Normally data is passed straight through the unit (with a one clock cycle delay). 
However, if there is an interrupt or pipeline stall then the Run-Out is able to 
capture the results of calculations already in-flight within the execution unit. Upon 
restart after the interrupt or pipeline stall it is then able to replay those results. This 
effectively allows functional units to be stalled or have their internal state captured 
without have to apply clock gating or provide state capture within the execution 
units themselves. 

□ Output Bank: The output bank is a local register file. The number of registers is 
equal to the number of output connections from the unit. As the execution unit 
generates data one or more of the registers in the Output Bank latches it. An 
output register mask is included in the execution word and specifies which 
registers should be latched for any given operation. New data from the execution 
is only latched if it is producing a valid output during that cycle. This is determined 
by using the select signal from the delay pipeline. Thus data remains in the output 
register until explicitly overwritten by subsequent operations performed on a 
functional unit. 



5.5 Controller Glue Unit 

The controller glue unit is responsible for generating the unit select signal, the result 
selector and the output register mask. The unit select signal is asserted if a new operation 
is being initiated on the execution unit during the cycle. The output register mask is used 
to control the latching of new data in the registers in the output banks. 

There are two distinct types of controller unit depending upon whether execution is 
conditional on a particular strand. If an execution unit has no side effects then it can be 
executed unconditionally as a speculative use of the unit will not have any permanent side 
effects. Units that result in side effects (such as any unit with internal state such as 
register file or memory unit) must always be executed conditionally. Unconditional units 
can have a more compact representation than conditional units as no strand number 
needs to be specified as part of the execution word. 

5.5.1 Conditional Controller 

The diagram below shows the internal architecture and connectivity for a conditional 
controller glue unit: 
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A conditional controller is used when the execution unit has internal state so that certain 
methods cannot be executed speculatively. 

The external connections for a the controller are as follows: 

□ SEM (Strand Enable Mask): This is a bit vector showing which strands are 
currently enabled (represented by a set bit). It is used to determine whether a 
particular operation should be performed or not, depending on which strand the 
operation belongs to. 



□ Strand/Result Field: This field specifies the strand number that the operation 
belongs to. It is obtained from a group of contiguous bits in the execution word. 
The width of the field is 4 bits to allow the use of up to 16 strands in a particular 
region. If the unit is being used to perform a copy operation then this field selects 
which of the input operands is being copied to the output. Copy operations are 
always performed unconditionally so no strand number needs to be specified. 

□ Opcode Field: This is a contiguous blocks of bits from the execution word that 
represents the currently selected operation code form the group used to control 
the unit. This is compared against a unique fixed selection code. If they match 
then the unit is selected during the cycle. All controllers have a fixed width to the 
opcode field that represents that maximum size of an opcode. In general the 
opcode field will be narrower and the unused bits are tied to 0. 

□ Fixed Selection Code: This is a fixed literal value that represents the unique 
selection code allocated to the unit. This allows the unit to be selected from other 
units that may share the same opcode and control bits from the execution word. 

□ Output Register Mask: This is a bit mask showing which of the registers in 
output banks of the functional unit should be updated with the results of an 
operation. These bits are obtained from a contiguous block from the execution 
word. If there is only one output bank with one output register then this field is not 
required as it is assumed that the register is always written. The output register 
mask allows arbitrary sets of registers to be updated by a particular operation. 
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Thus registers holding live data values from previous operations can be preserved 
without completely blocking usage of the unit. Each controller library block will 
have a output register mask field of a fixed size. If the field required is actually 
narrower then some bits will be unused. 

□ Copy Flag: This flag is obtained from the execution unit. If it is set then the unit 
should be used to perform a copy operation. A particular input operand is copied 
without modification to the output 

□ Method Held: This is the contiguous block of bits from the execution word that 
specifies the number of the method to be selected. If an immediate field is also 
present then this is also included. This field is latched for presentation to the 
execution unit during the execution cycle. This field may not be present if the 
execution unit only supports a single method and there is no immediate field. 
Each controller library block will have a method field of a fixed size. If the method 
field is actually narrower than that size the some bits will be unused. 

□ Wait Rag: If this signal is asserted then the pipeline is being stalled. No new 
operations should be initiated on the execution unit, independently of the opcode 
being presented. 

□ Unit Select: This is signal is set during the execution cycle if a new operation is 
being issued to the functional unit. 

Conditional controller glue units are parameterised by two distinct parameters. These are 
represented as part of the identifier for each unit. These parameters are: 

□ The width of the method field latch. If necessary a controller with a wider latch 
than required may be used. Additional unused bits are simply tied to 0. 

□ The width of the output register mask latch. If necessary a controller with a wider 
latch than required may be used. Additional unused bits are simply tied to 0. 

5.5.2 Unconditional Controller 

The diagram below shows the internal architecture of an unconditional controller glue unit: 
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An unconditional controller is used when the execution unit has no internal state so all 
methods can be performed speculatively. 

All inputs to and outputs from the immediate controller unit are as a conditional controller 
except that no SEM input or strand number is required. The method and result selector 
share the same field. If a copy operation is being performed then no method needs to be 
specified. The number of bits required to specify an operand number determines the 
minimum width for this combined field. 



Unconditional controllers are parameterised by the same parameters as conditional 
controllers. 



5.6 Operand Selector Glue Unit 

The diagram below shows the basic structure of an operand selector unit: 
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An operand selector is simply a multiplexer that directs the contents of a particular data 
bus to the output operand. The output operand is then. fed to the input of an execution 
unit. The switching of the multiplexer is performed during the first execution cycle of the 
execution unit. The multiplexer is controlled directly by specific bits in the instruction word. 
These are distributed during the decode cycle and are held in the latch until the first 
execution cycle. 

There may be a number of operand selector units associated with each execution unit in 
order to supply data to all of its operands. Some execution unit may be directly connected 
to a specific bus so no operand selector is required. 

Operand selectors are parameterized by two attributes. 

□ The width of the data buses. Support is provided for sizes between 1 bit and 32 
bits. If a particular width is not available then the next higher width may be used 
and the upper bits remain unconnected. 

□ The number of selection sources. This has a direct impact on the number of 
control bits required. Again, if a particular size if not available then the next biggest 
size may be used with unused sources connected to 0. 

5.7 Delay Pipeline Glue Unit 

The delay pipeline simply delays a number of parameters so that they are available on the 
clock cycle during which the results from the execution unit are generated. The delay 
pipeline is only required if the latency of the execution unit is greater than one clock. The 
delay required in the pipeline is one less than the latency of the execution unit. 

The delay pipeline is parameterised by the number of clock cycles of delay and by its 
width in bits. Different delay controllers are made available in the library for the various 
possible latencies and data widths. A wider data width can be used than is required and 
the unused bits simply tied to 0. 
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The architecture of the delay pipeline is shown below: 
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The connections are as follows: 

□ Operand Data: This is the selected data from each of the operands to the 
functional unit. These are delayed so that they may be presented to the result 
selector on the same clock cycle that the results are available from the execution 
unit. The width of the operands is the minimum of the operand width and the 
widest result port width. 

□ Result Selector: This is the selection value for result selector multiplexers). If it is 
0 then the result from the execution unit is selected, otherwise a particular delayed 
operand value is selected. 

□ Output Register Mask: This is the bit vector of registers that should be updated 
when the result is available from the execution unit. It is delayed so that it is 
available during the same clock cycle as the results. All bits are masked to 0 if no 
operation was selected for the unit during the cycle. 

5.8 Result Selector Glue Unit 

The diagram below shows the internal architecture of the result selector: 
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The result selector is simply a multiplexer that selects a data value from one of a number 
of inputs. The result from the execution unit is always presented as the input selected by a 
0 oh the result selector. Each of the other inputs is obtained from one of the operands to 
the execution unit. This allows any operand input to be copied to the result. The operand 
inputs will be delayed through the delay pipeline if the execution unit has multiple cycles of 
latency. If a particular execution unit has no input operands then the result selector is not 
required. 

The critical path will generally be through the results from the execution unit. Thus the 
multiplexer may be implemented to provide a faster path for that input than the other 
selections from the operands. 

The number of selections for the multiplexer and the width of the buses parameterise the 
result selector. Different result selectors are made available in the library for different 
numbers of operands and widths. If the exact selector is not available then one that is 
wider and/or allows more selections can be used and unused bits tied to 0. 



5.9 Run-Out Glue Unit 

The diagram below shows the internal architecture of a run-out glue unit- 
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A single run-out unit is present in the functional unit if the latency of the execution unit is 
more than one clock cycle and pipeline stalls and/or interrupts are supported by the 
architecture. The run-out captures data from all output ports from the execution unit and 
the output register mask. It then passes them on to the output bank units with a one clock 
cycle delay. If a pipeline stall or interrupt occurs the unit is able to store the data in an 
internal FIFO until the contents of the execution unit has been flushed. The length of the 
run-out is the same as the latency of the execution unit. The stored data can then be 
replayed as the execution is re-started. This mechanism allows the full internal state of a 
multi-cycle execution unit to be stored without the requirement for any internal state 
saving within the design. Moreover, it effectively allows the execution unit to be stalled on 
any cycle but without the requirement to provide clock gating or a stall input to the 
execution unit itself. 
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The maximum number of entries it can hold and the width of each entry parameterise the 
run-out unit. The library contains run-outs of different widths and depth. If a run-out of the 
exact width is not available then a wider one may be used. The unused bits are simply 
tied to 0. 



The inputs to and outputs from the unit are as follows: 

□ Data Inputs: The input data from the result selectors (or directly from the 
execution unit if no result selectors are present). All output results are stored in the 
same run-out unit. 



□ Data Outputs: The output data from the run-out. This is passed to the output 
bank units. In general this will simply be the input data delayed by one clock cycle. 
However, if data is being flushed from the run-out then this will be previously 
captured data. 

□ Register Mask Input: This is a bit vector showing which output registers should 
be latched with the corresponding data input generated during the cycle. All bits 
will be 0 if no valid data is being generated. This mask is stored alongside the 
data in the run-out so that it can be replayed in synchronisation with the data itself. 

□ Register Mask Output: This is the output register mask that is passed to the 
output bank units. In general this will simply be the input mask delayed by one 
clock cycle. However, if data is being flushed from the run-out then this will be 
previously captured masks. They are stored alongside the data itself so that they 
are replayed in synchronisation. 

□ Enter Interrupt: This signal is asserted if the processor is able to start responding 
to an interrupt. This causes the state of the main register to be copied to the 
shadow register. 

□ Exit Interrupt: This signal is asserted if the processor is returning from an 
interrupt response. This causes the shadow register state to be copied back into 
the main register. 

□ Wait Flag: If this flag is asserted then this means that the entire pipeline is being 
stalled. Outputs are stored in the order within the run-out until the wait is either de- 
asserted or the run-out becomes filled. When the wait flag is de-asserted the 
stored values are then flushed out of the run-out. 



The unit is composed of three main internal blocks: 

□ Control Logic: The control logic is responsible for the passage of data through 
the run-out. It determines the current queue head position and directs new data to 
the appropriate position in the run-out register file. The control logic also modifies 
its queuing behaviour as pipeline stall and interrupt entry/exit events occur. 

□ Main Run-Out: This is a register queue for holding the outputs from the execution 
unit. The number of entries is equal to the latency of the execution unit so that the 
whole execution unit pipeline can be flushed into the run-out in order to preserve 
its state. When the queue is empty new results are passed immediately to entry 0 
at the head of the queue. Entry 0 drives the data output from the unit. Entry 0 is 
handled specially so that it is not overwritten until the execution unit produces new 
valid data. Each run-out entry holds both a data item and the valid data flag 



25 



CriticalBlue Microarci^Jture Description 

derived from the unit select. Entries not containing valid data are queued along 
with valid entries to maintain the correct timing. 

□ Shadow Run-Out: This is a register queue of the same capacity as the main run- 
out. It is used to hold execution unit state if an interrupt handler is entered. The 
main run-out state can be transferred in one clock cycle to the shadow run-out. 
The shadow run-out can be subsequently transferred back to the main run-out in 
a single clock cycle. 

The following is a description of the operations performed surrounding interrupt and 
pipeline stall events. Note that an interrupt cannot be entered/exited while there is a 
pipeline stall condition. However, all other combinations of events are legal. 

5.9.1 Entering Interrupt 

Upon entry to an interrupt the state of the execution unit pipeline must be preserved. This 
means that any operations that have already entered the pipeline at the point of the 
interrupt entry are considered to be part main execution statQ. Thus all results generated 
by those operations must be preserved. Once the enter interrupt is received all outputs 
from the execution unit are queued in the main run-out. After a number of cycles 
corresponding to the latency of the unit all operations issued prior to the interrupt will have 
been cleared out of the execution pipeline. At that point the contents of the main run-out is 
copied to the shadow run-out as a single atomic operation. TTius the results from the pre- 
interrupt operations are preserved. 

5.9.2 Exiting Interrupt 

Upon exit from an interrupt the state preserved in the shadow run-out must be fed back 
into the execution pipeline again. Upon an exit from interrupt signal being received the 
shadow run-out is copied back to the main run-out as an atomic operation. New 
operations may then be issued to the execution unit. However, it takes the latency of the 
execution unit before those operations generated new results. During this time the results 
preserved before the interrupt are flushed out of the main run-out. New results are ready 
from the execution unit during the cycle that the run-out is fully flushed. 

5.9.3 Pipeline Stall 

The CriticalBlue architecture does not require the use of clock gating for execution units 
(gating may optionally be utilised as a power saving technique). Execution units do not 
have to provide any pipeline stall method. However, a functional unit that has asserted a 
wait signal may stall the entire CriticalBlue processor pipeline. Thus mechanisms external 
to the execution unit itself must manage the pipeline stall. If a stall occurs then the unit 
select for the execution is masked off so no new operations can enter the pipeline. 
However, the results of operations that have already entered the pipeline must be 
preserved. They are queued in the main run-out. The maximum queue length does not 
need to exceed the latency of the unit (even if the stall period is longer) since the 
execution pipeline will have been fully flushed after the latency period. 

5.9.4 Pipeline Restart After Stall 

After a pipeline stall, queued entries from the run-out are presented as the output data 
until all queued entries have been flushed. Thus as the pipeline is restarted the previously 
preserved results are made available as though they had just been generated by the 
execution unit. Thus, externally, the output is equivalent to that which would be observed 
if the actual execution unit pipeline itself had been stalled. 
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5.10 Output Bank Glue Unit 

The output bank glue units hold results from a particular result port of an execution unit. 
An output bank glue unit may drive a number of connection buses to the operands of 
other functional units. Each bus has an associated register. The output register mask (that 
is specified as part of the execution word) determines which particular registers are 
updated by an operation executed on the unit. 

Each output register also contains internal storage for maintaining the output state if an 
interrupt occurs. An interrupt response is immediate so the state of the functional units 
must be preserved simultaneously. This state is later restored on return from the interrupt 
handler so that execution can continue from the point of interruption. Nested interrupts are 
not permitted so only one level of storage is required. 

The architecture of the output bank glue unit is shown below: 
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The result data is made available to all of the individual output registers. Any number of 
these output registers can latch the data available on this bus depending on the state of 
the register mask. Each bit of the register mask controls the latching of an individual 
register. 

Output bank units are parameterised by two attributes: 

□ The number of individual output registers required. This corresponds to the 
number of bus connections that are driven by the unit. 

□ The width of the output result. If necessary an output unit wider than that required 
can be used. The unused bits are tied to 0. 

The diagram below shows the internal architecture of an individual output register unit: 
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Under normal operation the execution unit data is latched into the main register. This 
ensures that the data is constant for the duration of the following cycle to be driven to 
other functional units, which may read it as operands data. Secondary storage is provided 
in the form of a shadow register. This is used to hold the previous state of the main 
register if an interrupt response is initiated. Thus the execution unit state is preserved and 
can be restored on return from the interrupt. 

The meanings of the various inputs to and outputs from the unit are as follows: 

□ Data Input: The result data as distributed throughout the output bank unit. 

□ Data Output: The latched result data that is driven as a point-to-point connection 
to an operand input of a functional unit. The output must have sufficient drive 
strength to drive the connecting bus. The connectivity algorithms ensure that a 
maximum bus length is not exceeded. The data is driven one clock cycle later 
than the output from the execution unit. 

□ Output Mask Bit: This is an individual bit from the register output mask, it 
Indicates that the output from the execution unit should be latched by the register. 
If no new value should be latched then the current state is latched back into the 
register. 

□ Enter Interrupt: This signal is asserted if the processor is able to start responding 
to an interrupt. It causes the state of the main register to be copied to the shadow 
register. 

□ Exit Interrupt: This signal that is asserted if the processor is returning from an 
interrupt response. It causes the shadow register state to be copied back into the 
main register. 
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Test Port: This is special serial port that forms part of the debug chain in the 
processor. It allows the output register to be read and modified via a single serial 
port connected to all functional units. Each output register in the processor is 
connected to the debug port and is a unique address so that the debug port can 
address it. 



6 Pipeline Timing 

This section describes the cycle level timing of various activities in the CriticalBIue 
processor pipeline. The architecture is characterized by having a short and highly regular 
control pipeline but with the flexibility to allow functional units with arbitrary length internal 
pipelines to be included in the processor. 

6.1 Overview 

Like all contemporary processor architectures, CriticalBIue is pipelined. This allows it to 
extract higher levels of performance by overlapping certain operations. All aspects of the 
pipelining are under user control. Due to the partitioned nature of the control flow paths in 
comparison to the data paths, they have separate pipelines. 

The control path uses a very simple three stage pipeline reminiscent of early RISC 
architectures. The three stages are fetch, decode and execute. During the fetch stage the 
next execution word is read form the instruction buffer. During the decode stage the 
execution word is distributed to the functional units and the appropriate segments are 
decoded by the units. Finally, during the execute cycle the operations are presented and 
initiated in the appropriate functional units. 

Each of the functional units has its own, independent, pipeline controlled from a master 
clock. The length of the. execution pipeline for each unit is specified in the execution unit 
model as its latency. The code generator automatically takes account of the length of 
execution unit pipelines in the management of the data flow between functional units. The 
independent specification of the pipeline length for each execution unit allows great 
flexibility in the construction of the individual units. Each functional unit can generate a 
wait signal. If this is asserted then the entire pipeline of the processor is stalled. This 
allows the implementation of execution units that sometimes require an extended latency 
period. For instance, it can be used for cache memory units where the latency is longer if 
a particular data item is not present in the cache. 

The short pipeline allows the branch that occurs at the end of a region to occur without 
any pipeline bubbles. The last execution of region can occur back-to-back with the first 
execution cycle of the succeeding one. 

6.2 Instruction Timing 

The diagram below illustrates the timing of instructions executed in a CriticalBIue 
processor. 
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During the fetch cycle the EWA (Execution Word Address) is used to address the 
instruction buffer and obtain an execution word. During the decode cycle the appropriate 
bits from the word are distributed to all the functional units in the system. Each of these 
has comparison logic embedded within the controller glue unit to determine if an operation 
for that unit has been selected. If so then an operation on functional unit is initiated. All 
functional units have at least one execute cycle. The data buses distribute results from 
functional units to the inputs of other functional units. Each functional unit has a user 
defined latency. The pipelines of the functional units run independently and do not affect 
the timing of instruction fetching and decoding. 

In a typical RISC pipeline all functional unit pipelines must be completed by a write back 
of results into a centralised register file. Thus the individual functional unit pipelines are 
intimately tied into the overall control pipeline of the processor as appropriate feed forward 
paths must be managed to feed data from register writes to subsequent register reads. 
Since a CriticalBlue processor uses software to manage access to register files (they are 
treated like any other functional unit) the functional unit pipelines can be effectively 
separated from the overall fetch and decode pipeline. The code scheduling manages the 
data bus resources and ensures results are only read when they are available from the 
outputs from functional units. 



6.3 Data Flow Timing 

The diagram below illustrates the data flow timing of functional units in a CriticalBlue 
system. It shows a particular dynamic data path through the functional units as results are 
passed from one unit to the next. 
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The grey areas represent the data bus connectivity network in the processor The blue 
line represents the path of a particular data item as it is scheduled over those buses The 
initial result is produced from a single cycle functional unit. The result is calculated during 
cycle 1. It is latched by the output register in the unit at the end of cycle 1 and then driven 
onto the output bus during cycle 2. At the start of cycle 2 the result is steered into a two- 
cycle lunctional unit operand. It is operated upon during the remainder of that cycle and 
dunng cycle 3. At the end of cycle 3 it is latched into the output register and the result 
dnven during cycle 4. It is then steered into another single cycle execution unit. Finally it is 
held in the output register for an extra cycle while other operands for a subsequent 
operation become available. During cycle 6 the data item is read into a memory unit. 

6.4 Squash-Squashed Timing 

The diagram below shows the timing from the issuing of a squash operation and the 
selected strands actually being squashed. 
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The first execution cycle of the squash operation reads the squash status operand. This 
may have been calculated by an arithmetic operation in a previous clock cycle. During the 
first execution cycle the squash vector is created and distributed to the strand control unit. 
During the next cycle (effectively the second execution cycle of squash) the strand control 
unit uses the squash vector to update the current SPM and thus the value of SEM. This 
SEM is driven out of the strand control unit to all functional units during the cycle. Thus it 
is available in the decode cycle for all lunctional units in the following cycle. This updated 
SEM determines whether operations from particular strands will actually be executed. 

Thus there is a latency of 3 for squash operations. There must be two clock cycles 
between a squash operation and the first operation from a potentially squashed strand 
that is guaranteed disabled if the squash was active. 

6.5 Squash-Commit Timing 

This diagram shows the timing between the issue of a squash operation and the issue of 
a commit operation for a strand that may have been squashed. 
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A timing relationship exists between a squash and the following commit for an affected 
strand because the predicate state of a strand must be known at the point of commit. 
Thus if a strand is to be squashed it be so by the commit point 

During the first execution cycle of a squash operation the squash condition is calculated 
and distributed in the form of a squash vector to the strand control unit. During the 
following cycle the strand control unit calculates an updated SEM (as determined by the 
updated SPM). This is then available during the first execute cycle of a later commit 
operation. The updated SEM is used by the branch control unit for the CSM calculation. 
This updated CSM will include the final predicate state for the strand. This avoids the 
strand being marked as committed in CSM and being subsequently squashed. 

Thus there is a latency of 2 clock cycles between a squash and subsequent commit. 
There must be a filler instruction between them. 



6«6 GoBiiiiiit-Gomrtist Timing 

The diagram shows the relationship between two commit operations. Note that this timing 
relationship is only required if the first strand being committed contains a branch and the 
second commit is for a higher numbered strand. In other cases commit operations may be 
issued on subsequent cycles, or even on the same cycle if multiple commit units are 
supported in the architecture. 



33 



CriticalBlue 



Microarchitecture Description 



Clock Cycles 



Instruction (containing 
first commit operation) 



: : : : 




| Fetch 


Decode 


Execute 1 

L s 1 


Execute 2 


Execute 3 
, ^ 



— - — — — — i — pi i j^ig^ucu cu 

control unit and new CSM calculated strands squashed 



Following Instruction 



Following Instruction 



Following Instruction 
(containing second 
commit operation) 



| Fetch 


Decode 


Execute 


At 








Fetch 


Decode 


Execute 



New SHV| available for 

calculating new CSM for 
second commit 



Fetch 



Updated SEM 
calculated 



i 



Decode Execute 1 



Note that this timing is only required if the first commit is for a strand containing 
a branch. If there is no branch then commits may occur on successive cycles. 



The updated CSM is calculated during the first execute cycle of the commit in the branch 
control unit During the second execute cycle the updated CSM is available and is used to 
resolve the current branch status, The active branch is selected from the branch registers 
and loaded into the NRA. A squash vector is also generated by the branch control unit 
and distributed to the strand control unit. This squashes all strands higher than the taken 
branch. During the next cycle an updated SEM is calculated in the strand control unit to 
account for the squashes. The updated SEM is then available on the next cycle for 
determining an updated CSM fpr a subsequent commit. 

Thus there is a latency of 3 cycles between a commit (for a branch containing strand) and 
a subsequent commit. There must be two filler instructions between them. This 
requirement stems from the fact that a commit operation can itself cause squashes of 
higher numbered strands if a branch from the committing strand is selected. Thus 
sufficient cycles must be left to allow the squash to propagate through the system so that 
the final squash state of the subsequent strand is known prior to its commit. 

6.7 Commit-Succession Timing 

The diagram below shows the timing relationship between the commit of the final strand 
containing a branch and the region succession. If the strand being committed does not 
contain a branch then the commit in the last few cycles of the region (even in the last 
instruction word if there are no actual committed operations in the strand). 
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A new CSM is calculated in the branch control unit during the first execute cycle of the 
commit operation. Any branch from the strand is then selected during the second execute 
cycle of the commit operation. This is stored in the NRA that is distributed to the 
instruction control unit during the cycle. This address is then available to look up the 
instruction buffer address during the next cycle. This looked up address can then be used 
to perform the first fetch on the following region. 

Thus there must be at least 4 instruction words in the region following the commit of the 
final strand that contains a branch. 



6.8 Region Succession Mechanism 

The control mechanism for performing a region succession is illustrated in the diagram 
below. Such a succession occurs when the end of a region is reached. It may also occur 
when entering an interrupt handler, in which case the ERF flag is generated by the 
interrupt control logic. 
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The destination address is determined prior to the end of the region. A sufficient number 
of clock cycles are left between the resolution of the last potential branch in the region and 
the last instruction word in the region (in which the ERF flag is set). This will leave a new 
instruction buffer address available that has been looked up from the region cache in the 
instruction control unit. 



The mechanism allows a flag to be set on the last instruction of a region (ERF) and to 
immediately initiate a succession so that the first instruction from the new region can be 
executed without any further latency. 

The End Region Flag (ERF) is used to direct a multiplexer that selects the next execution 
address from either the Execution Word Address (EWA) or the new address. This 
selection can be performed during the same cycle as the access itself, thus allowing very 
quick address steering. The EWA is incremented by one on each cycle so that execution 
is advanced through the region. Thus the second instruction of the new region is executed 
as EWA is loaded with the new address plus one. 

If the successive region is not present in the instruction buffer then no new address can 
be calculated. In that case the instruction control unit arranges the transfer of the required 
region into the instruction buffer before execution commences. The transfer starts in the 
cycle immediately following end of the previous region. 



7 Control Units 

7.1 Instruction Buffer Unit 

The instruction buffer unit is simply a memory block used for holding instructions being 
executed by a CriticalBlue processor. The width of the memory is determined by the 
instruction width of the processor. The instruction width is a configurable attribute of the 
processor and any width of 32 bits of greater that is a multiple of 8 bits may be chosen. 
There are trade-offs between code density and potential parallelism for different 
instruction widths. 
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The instruction buffer must have a latency of one clock cycle. Data read from the buffer is 
distributed to the individual functional units in order to provide processor control. The 
instruction control unit generates the read addresses. The instruction control unit also 
provides all the memory control signals and data to be written into the instruction buffer. 

The instruction buffer may be loaded with a fixed block of code upon initialization 
(uncached) or may be updated dynamically as code is being executed (cached). The 
instruction control unit handles all the required control. For uncached usage the size of the 
instruction buffer memory represents the total code storage space. For cached usage the 
size represents the size of the working set of code that may be held and different sizes 
impact performance. 

7J2 Instruction Control Unit 

This unit generates the current instruction word address used to access the code 
memory. This address is incremented during each clock cycle in order to execute code 
sequentially. When the fetch unit detects a flag indicating that the end of the region has 
been reached it updates the instruction address with the NRA generated by the branch 
control unit. 

Two different variants of the instruction control unit are available depending upon whether 
an uncached or cached arrangement of the instruction buffer is being used. The caching 
arrangement employed does not change the structure of the instruction buffer itself. 

In an uncached arrangement the architecture forms a fully Harvard architecture. The 
instruction buffer is loaded with, fixed code upon initialisation and it is not varied during 
execution. This is suitable for CriticalBlue processors running smaller applications where 
the whole program can be sensibly loaded into an. on-chip instruction buffer (usually 
implemented using SRAM). All destination addresses for branches directly specify 
locations in the instruction buffer. A disadvantage of this approach is that the original host 
code must be present in the main memory. It is required to allow the preservation of the 
original function entry and function return points. These are filled with link words pointing 
to equivalent code in the instruction buffer, allowing indirect function calls to be made and 
original return addresses. 

In a cached arrangement the instruction buffer is used to hold individual code regions that 
have been cached from main memory. Some regions may be locked permanently into the 
instruction buffer but the bulk of the space is used for holding region code that is being 
accessed on demand. A cached instruction buffer requires considerably more control 
logic in the instruction control unit. It must maintain a record of which individual regions 
are held in the buffer. Thus a mapping from the region address to its location in the 
instruction buffer is required. The control unit also requires logic to load regions from main 
memory into the instruction buffer on demand. The cached arrangement allows 
CriticalBlue executable code to be interspersed with the original code so that only certain 
key locations from the original code need to be maintained. This allows more efficient 
utilisation of the memory space. 

7.2.1 Execution Word Address (EWA) 

The EWA is used to hold the current address in the instruction buffer from execution 
words are being read. This is the equivalent of the Program Counter (PC) in more 
traditional architectures. The EWA is incremented by one on each clock cycle unless 
there is a pipeline stall. The EWA is loaded with a new region start address when the end 
of execution of the current region is reached. 
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7.2.2 Uncached Instruction Buffer Control 

The diagram below shows the instruction control logic for an uncached system: 
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As can be seen, the logic required for the uncached case is minimal. The EWA holds the 
current execution address and is incremented on each cycle. This is used* to address the 
instruction buffer. When the end of the current region is reached a new address is 
selected from the NRA. 

The instruction buffer is normally implemented using volatile memory such as SRAM. 
Thus it must be loaded with the program code on power up. This is achieved using the 
boot load unit. This obtains the code form an external source (perhaps an external serial 
ROM) and copies it into the instruction buffer. The EWA may be initially reset to 0 and the 
increment as required to load all the code. 

7.2.3 Cacheable Code Layout 

If a cached instruction buffer is used then the executable code appears in the main 
address map of the processor. The destinations that are supplied to branches are 
addressable in main memory (or shadow memory for breakpoint branches). The width of 
the main memory is fixed at 32 bits whereas the width of CriticalBlue instruction words is 
configurable and may be larger than 32 bits. The code is subdivided into 32 bit segments 
to be stored in main memory. This are reassembled into longer instruction widths upon 
loading to be written into the instruction buffer. 

Code and data need to be mixed in the address map. To allow CriticalBlue to execute a 
direct translation of original host code, all data addresses are preserved exactly. Some 
compilers intersperse constant data sections with executable code. Link words are also 
placed at the original addresses for all function return points and the entry address of all 
functions so that they may be called indirectly. Thus the original memory map will have 
many individual word addresses that must hold specific values and cannot be used for 
CriticalBlue code storage. 

In order to obtain good code density, usage must be made of the unused words between 
these fixed locations. These unused words will still represent the majority of words from 
the. executable image as the . original code can be overwritten. The placement of the 
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translated code in those unused words can be controlled by the code generation, as there 
is no requirement to place particular functions at particular locations. However, the 
CriticalBlue execution model requires the code for a whole region to be transferred from 
main memory to the instruction buffer before execution of it can commence. The sizes of 
regions are variable and there is not necessarily a good match to the blocks of unused 
space in the memory map. Thus attempting to allocate regions to contiguous unused 
blocks in the memory map leads to high levels of fragmentation and thus space 
inefficiency. 

To overcome this the representation of a region in memory allows some scattering across 
multiple blocks of unused memory. The. nriain memory to instruction buffer transfer 
process reforms the scatter blocks back into contiguous regions. 

The first word of any region in main memory is a control word. This is not part of the code 
to be loaded into the instruction buffer but simply indicates how the code is laid out in 
memory. The control word is composed of a number of bit fields that indicate the length of 
contiguous sections within the region. The sections flip between valid code and sections 
containing preserved values. 

The layout concept of the control word is illustrated in the diagram below: 
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Bit field a gives the number of words in the first bfock of code. Bit field b gives the number 
of preserved words that are not part of the code. Bit field c gives the number of words in 
the next block of code and so. This arrangement allows code to be spread over a highly 
fragmented address space with the minimum of addition complexity and control data 
overhead. 

7.2.4 Cached Instruction Buffer Control 

The control logic for a cached instruction buffer is considerably more complex. The 
instruction control unit must maintain a record of all regions that are held in the instruction 
buffer so that they can be reused if the same region is branched to again. A replacement 
scheme is also required to determine which region to evict if a new region must be loaded 
into the cache. The cache must ensure that the whole of a required region is brought into 
the instruction buffer and held in its entirety. Regions cannot be larger than the size of the 
instruction buffer. 
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The basic instruction buffer allocation scheme is illustrated in the following diagram: 
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Instruction buffer memory is used on a round-robin basis. The Current Buffer Pointer 
(CBP) register indicates the current address in the instruction buffer. This is the address at 
which any new region executed is written. The pointer wraps around when it reaches the 
end of instruction buffer (which will be a power of 2 in size). Thus the oldest region to be 
written will eventually be overwritten. If code execution remains within a sequence of 
regions that can all be held in the instruction buffer then no additional region loading from 
main memory is required. 

The Lowest Buffer Pointer (LBP) indicates the point that the CBP should be reset to when 
it reaches the end of the buffer. Thus memory between 0 and the LBP is "locked down" 
and is held permanently in the buffer. This is used for holding regions that are very critical 
to performance and for which ^fast and deterministic execution times are required. The 
code for handling fast interrupts is loaded upon initialisation into this locked area. 

The Buffer Address CAM (BAC) is held in the instruction control unit. It holds the mapping 
between region addresses and their actual locations in the instruction buffer. The BAC 
has a number of individual entries. The number of entries is determined so that, given 
average size regions, there are enough to hold mappings for all the regions in the 
instruction buffer. 

The allocation of entries in the BAC occurs on a round robin basis like the instruction 
buffer itself, the Current Entry Pointer (CEP) addresses the next entry to be used. Upon 
wrapping around the CEP is reset to the value of the Lowest Entry Pointer (LEP). Entries 
below the LEP are used for pointing to locked down regions. 



The structure of the instruction control logic required for a cached system is shown below: 



40 



CriticalBlue 

Next Region Address (NRA) 



Microarchitecture Description 



Buffer Address CAM (BAC) 



Compare 



Memory Address Buffer Address 



Compare 



Compare 



Memory Address Buffer Address 



Compare 



Compare 



Memory Address 



Address -» Compare |«- 



Region Loader 
Control Unit 




Memory 
Address 

.Memory 
r Control 

Instruction 
►Buffer Write 
Data 

. Data from 
main memory 

_ Data from 
shadow memory 



^Instruction 
Buffer Address 



This is considerably more complex than the logic required for an uncached 
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ffi^hJ^ lS . ° a ? n £ a neW region into the ^ruction buffer to clear any entries 
that are being overwntten by the new entry. 

When a new NRA is supplied during region execution it is looked up in the BAC This 
supplies the instruction buffer address that is held in the new address renter a £e 

When L n °l^H n the bUffer th9n thiS iS detected * the Re 9 ion Loadef Antral uSt 
When the end of the region is reached then the new address is selected for the 
addressing the instruction buffer. If the region is not present then the pipeline is stalted 
while the new region is loaded. The new buffer address is loaded into EWA via the 
increment unit Execution then continues through the instruction buffer until the end of the 
region is reached. 

Sl e hMSl° n it L ° ader 0t S rol Unit °? ntains 8,1 *» '°9 ic re ^ ired to load a region into 
toe buffer. It recedes the region base address from the NRA. It issues the address to 
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wntten to the instruction buffer at the address specified by the CBP. The CBP is then 

incremented. There will be one or more memory accesses for each instruction buffer 
access as the instruction buffer is wider. instruction Durrer 

ln!nn fn^ ° f *? T^f r « e mappin9 between me NRA and ^se ^ess of the new 
Z 9 1h th l'" St . ruc *' on buffer is ^en to BAC. The entry pointed to by the CEP is 
Knffl'rfh 6 2P ,S then ,ncrem ented. As instruction words are wntten to the instruction 
buffer the address is compared against all the instruction buffer addresses in the BAC If 
there is a match then the entry is cleared. Thus if the new region causes the full or partial 
overwnting of an existing entry then its mapping is cleared from the BAC 
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7.3 Test Unit 

The test unit allows control of testing for manufacturing test purposes. The test unit is only 
required in ASIC implementations of a CriticalBlue processor. FPGA implementations do 
not require it, as the underlying silicon of the FPGA will have already been fully tested. 

The test unit has an external JTAG interface that controls one or more scan chains. 
These scan chains connect to multiple units within the processor. These include: 

□ Control Units: Each of the control units has a scan chain for checking its 
operation. 

□ Glue Units: Glues units have scan chains for checking their operation. In 
particular, output registers and output run outs have a data loading capability via 
their scan chains. This allows data patterns to be loaded during test to check the 
interconnects between different functional units. 

□ Execution Units: Execution units themselves may have scan chains. The test 
unit only forms an interface to these. Their internal operation and the test patterns 
used are the responsibility of the execution unit designer. 



7 A Branch Control Unit 

The branch control unit determines which region will be executed next. The unit is able to 
handle multi-way branch conditions. A number of branch destinations with associated 
conditions may be issued in a region. The branch control unit determines which branch 
will be taken on the basis of which conditions evaluate to true and the relative priority of 
the branches. The unit also controls the initiation of handler functions for interrupts. 

The branch unit is used for all types of branches including calls, returns and the vectoring 
to interrupt handlers. 

7.4.1 Region Branch State (RBS) 

The RBS is a register that holds the current state for a destination branch selected from a 
region. The RBS is used to determine how certain other control registers (such as the 
SPM) should be updated when the end of the region is reached. 

The RBS has three states as listed below: 

□ Repeat: This is the default state that indicates that the current region should be 
re-executed. The region address is obtained from the RBA register. If the region 
should be repeated then any strands that would otherwise cause a branch are 
aborted and thus do not appear as committed strands in the CSM. If a repeat of 
the current region is performed then the SPM is updated prior to re-execution by 
squashed any non-aborted strands (so that they are not executed again). 

□ Branch: Indicates that the branch control unit has selected a branch destination. 
A branch is selected if a branch has been issued for a strand and the strand 
becomes committed. The branch from the lowest numbered committed strand is 
selected. The branch destination specifies a base strand number. This is used to 
set the initial value for the SPM register at the destination. 

□ Breakpoint: Indicates that a breakpoint branch has been selected. This happens 
if a breakpoint is encountered in the main code. Such a branch is selected if a 
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strand is committed that has a breakpoint set on it. The branch is given the same 
pnorrty as a strand issued for that branch (but usurps any branch issued for the 
break-pointed strand). At the destination the SPM is set to squash all strands 
except the one with the reached breakpoint. This ensures that onlv the 
appropriate strand is executed in the shadow code. 

The state transitions are detailed in the following diagram. When the end of a reqion 
execution is reached the state is reset to Repeat. 9 




SSLX \— J ^New Reg.cn 



The PBS has a shadow register. The current contents of the RBS are copied to the 
shadow register rf an interrupt handler is called. This allows the previous RBS to be 
restored on return from the interrupt. 

7 -4.2 Region Base Address (RBA) 

The RBA register holds the address of the start of the region currently being executed It ' 
is loaded wrth the value of NRA at the start of each new region execution It is used to 
vfoSf 6 NRA ^ 3 re9 '° n ' S t0 158 re P eated due to a dependency 

The RBA has 1wo separate shadow registers. One is for shadowing the register in the 
case of an interrupt. The second is in the case of entry to the monitor. The RBA and the 
two shadow copies are effectively part of the branch resolution unit. " 

When an interrupt occurs the contents of RBA is copied to the interrupt shadow. Upon 
return from toe interrupt the shadow version is selected as the destination address that is 
baded into NRA. When the return branch is actually performed this is copied back into 
RBA. Thus execution is restored into the region that was interrupted. 

When entry to the monitor code occurs the contents of RBA is copied to the monitor 
shadow register Such an entry occurs when executing shadow code and no explicit 
destination branch is selected. At this point the monitor is entered as any single step 
I been Performed. Upon return from the monitor the shadow version is 
setec ted as the destination address that is loaded into NRA. When the return branch is 
actually performed this is copied back into the RBA. Thus execution is restored in the 
current shadow region. 

7.4.3 Execution Priority Level (EPL) 

The EPL register holds the current execution state. This has two individual components. 
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7.4.3.1 Interrupt Flag 

The intenrupt flag is set if the processor is currently within an interrupt handler. If this flaq is 
set then no further interrupts are handled until a return from the current handler has been 
made. This is made a separate flag so that breakpoints can be handled without 
restrictions within an interrupt handler itself. 

7.4.3.2 Execution Status 

This shows the current status of execution. The processor enters different states 
depending upon whether it is executing normal code, shadow code or has entered the 
monitor after reaching a breakpoint. Some functional units may only be used when the 
processor is in certain modes. 



I LEVEL 


VALUE 


DESCRIPTION 


Normal 


0 


Normal execution mode. If a breakpoint is encountered then the 
breakpoint branch is taken and the processor enters shadow mode 


Shadow 


1 


Indicates that the processor is executing shadow code. No further 
interrupts will be processed while in this mode. If a region does not 
execute a branch then the monitor branch is taken (rather than 
restarting the region) and monitor mode is entered 


Monitor 


2 


Indicates that the processor is in monitor mode. No further 
interrupts will be processed. This provides access to certain 
functional units that are not otherwise accessible (such as the 
instruction predicate and debug communication units). Monitor 
mode may also be entered by certain interrupts. 



The diagram below shows the state transitions for the execution level: 
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7.4.4 Next Region Attributes (NRA) 

The NRA register contains the address of the next region that is to be executed The 
branch resolution unit calculates the NRA as each strand is committed. The branch from 
the lowest numbered strand is selected as the destination. The branch does not occur 
until the end of the current region is reached. 
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The NRA may also be loaded with the address of an interrupt handler if an InterruDt 
request is received. Such a request may be received at any point any is responded to 
immediately, even if the current region has not finished executing. 

The NRA register has a shadow copy. This is used to preserve the NRA state rf an 
interrupt occurs. Upon return from the interrupt the shadow copy of NRA is restored into 
the main register. This allows execution of the region to be correctly restored even if the 
destination region had been selected prior to the interrupt occurring. 

The NRA consists of two fields. There is a full destination address that allows the 
specification of an address in the shadow memory in addition to the main memory This is 
required for breakpoint branches. There is also a strand field for specifying the first strand 
that should be executed at the destination. This allows branches to side entries of a 
region. 

When tlie end of the region is reached the NRA register is used to find the correct entry in 
the instmction buffer and to set the initial state of SPM (if an ordinary branch is being 
performed). The address of the region within the instruction buffer is loaded into EWA 
from where execution of the region is commenced. This is illustrated in the diagram 
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7.4.5 Branch Issue 

Branches operations may be issued to the branch unit. Branch operations only load the 
required destination information into the branch registers within the branch control unit 
The actual branch is not performed until the end of the region is reached. Thus a multi- 
way branch is resolved at the end of the region. 

Each strand is able to perform a single branch. A branch within a particular strand is 
unconditional. It is effectively performed at the end of the strand as the branch does not 
occur until the end of the region. 

7A6 Branch Resolution Architecture 

This unit is responsible for selecting a destination address. It does this by selecting one 
particular address from a bank of possible destinations. 

There are three separate banks of destinations: 

□ Strand Branches: There is a separate branch register for each strand. If the 
strand is committed then any branch issued for it is selected as the destination 
(unless usurped by a branch from an earlier strand) 

□ Breakpoint Branch: This branch is performed if a breakpoint is encountered 
when executing the main code. 
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□ Monitor Branch: This branch is performed if no other branch is selected when 
executing shadow code. 

□ RBA Branch: This is the base address of the region currently being executed. 
This destination is used if the current region is to be restarted due to a hazard or 
guard. 

The structure of the branch resolution unit is shown below. 
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When a branch is issued the destination is generally taken from the operand supplied. 
Howeve*r, if the branch method is' a return (monitor or interrupt) then the destination is 
actually selected from one of shadow RBA registers and the supplied destination is 
ignored. For an interrupt return the interrupt shadow RBA is selected that holds the RBA 
at the point the interrupt was handled. For a monitor return the monitor shadow RBA is 
selected to return executing to the point at which the monitor was entered. 

Each strand branch register has a state bit showing whether it has been loaded with a 
branch destination in the current region. If it has, and the strand is selected in the CSM, 
then the branch is considered as a possible destination. If there are multiple possible 
destinations then the one with the lowest strand number is selected. Since there is a 
specific register for each strand this can be determined statically without the need for 
comparison logic. 

The breakpoint branch is enabled by a breakpoint hit flag that is asserted if a strand with a 
breakpoint on it is committed. 

The main RBA register is the default destination if no other is selected. This causes the 
current region to be re-executed. 

The monitor entry address register is only utilised if executing shadow code. The register 
is loaded with the entry point of the monitor program. The branch is taken if no other 
specific branch is selected (shadow code never requires restarts using the RBA register). 
The monitor is then entered to allow the current machine state to be reported or advanced 
as required. 
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When a destination is selected the information is copied to the NRA register This is 
distributed to the instruction control unit. The instruction control unit then performs a 
lookup to determine if the region is currently held in the instruction buffer. If not then the 
region is fetched from main memory when the end of the region is reached. If the region is 
present then its location in the instruction buffer is located so that execution can continue 
immediately after the end of the current region. 

The branch resolution unit selects the appropriate destination and also generates a 
squash vector. This vector is distributed to the strand control unit. All strands higher than 
the strand containing the taken branch are squashed. This is because they are not 
logically reached in the code if an earlier branch is taken. 

All state in the branch control unit has shadow copies. If an interrupt is entered then a 
copy is made in the shadow registers. Upon return from an interrupt the state is restored 
into the main register. This allows execution to be restarted correctly from the point of the 
interrupt with the correct information held by the branch resolution unit. 

7.4.7 Interrupt Resolution Architecture 

This branch unit also handles branches initiated by interrupts. This interrupts are 
asynchronous and may be generated from individual execution units or from sources 
external to the system. When an interrupt is received it causes the processor to enter the 
appropriate handler routine as quickly as possible. 

To achieve short and deterministic interrupt latencies, the response to an interrupt occurs 
dunng the execution of a region before it is completed. Only one interrupt is handled at a 
time. Interrupt priorities or nested interrupts are not handled by the architecture If the 
processor is already handling an interrupt then no response to further interrupts is made 
until the handler is completed. Entry to the handler clears the interrupt state. 

The diagram below shows the structure of the interrupt registers in the branch control unit: 

Interrupt Registers to Support M Interrupts 
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Each interrupt source has a destination address that is loaded during the initialization of 
the program. When an interrupt is received the destination address is loaded into NRA 
and a branch is immediately initiated. This may override any branch that has been 
previously selected within the region currently being executed. Many control registers and 
the branch registers within the branch control unit have shadow copies. A copy is made 
from the main state to the shadow state upon entry to the interrupt handler. This allows 
the state to be stored quickly and to be restored immediately upon return from the 
interrupt. Thus the same destination will be re-selected upon interrupt return. 

Fast and slow interrupts are supported. The only difference between these is that the 
regions for the handler function for a fast interrupt are locked into the instruction buffer 
Thus there is no latency while the regions are copied from main memory. If a fast interrupt 
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occurs while the regions are being loaded for a slow interrupt then the fast interrupt 
usurps the slow interrupt. »«oiiu^i 

7.5 Main Breakpoint Registers (MBR) 

The branch control unit contains a number of Main Breakpoint Registers (MBR) These 
are used to implement breakpoints that may be set using the monitor. These breakpoints 
are initially set via the branch unit. The breakpoint positions are set using the destination 
operand of the branch unit that is transferred into a selected MBR. 

Breakpoint addresses set using the monitor are specified in terms of an address in the 
onginal code. The monitor using a mapping table generated by the code generation 
process to translates this into the equivalent position in the translated code. 

A breakpoint activated on entry to a particular strand within a particular region 
Breakpoints must be set on strand granularity since they represent blocks of conditionally 
!Z th e original code. A breakpoint must not be activated until it is 
established that the particular strand is to be committed. 

Whenever a region succession occurs, the NRA is compared against the set of region 
addresses in the MBR registers. Each of these has. an associated strand number. If there 
Si ? ° e \f b, : eak P 0 i nt has been set h the region being branch to) then the associated 
strand number is used to generate the Breakpoint Disable Mask (BDM). This is used in 
Oia calculation of the SEM and disables all strands for which a breakpoint is active 

T y K S6t "Vt! 6 83,716 re9ion ' in rase ma V be multiple 
hits in the MBR. The strand for each breakpoint is included in the BDM. 

An actual breakpoint event occurs if an attempt is made to commit a strand that has an 
active breakpoint associated with it. This indicates the execution has logically entered the 
* ♦ *** been s q uashed 0-e. it is. not logically executed) or has been 
aborted due to a hazard then the strand is not included in the CSM and thus the 
breakpoint event does not occur. 

If a breakpoint event occurs then a branch is made to the shadow code via a breakpoint 
branch. The mask of the BDM and CSM gives the set of strands causing the breakpoint 
event. The lowest such strand is used as the seed value for SPM for execution of the 
shadow region. This ensures that only code from the breakpoint strand is executed. 

The diagram below shows the structure of the MBR and the breakpoint mechanism. 



48 



CriticalBlue 

Next Region 
Address (NRA) 

a 



Microarchitecture Description 
Registers to Support N Breakpoints 



Compare 



Region Address 



Breakpoint Strand 



Compare 



n 



Region Address 



Breakpoint Strand 



Breakpoint N-l 
Breakpoint N-2 



Compare 



Region Address 



Breakpoint strand Breakpoint 0 



Combine Strands 



Committed Strand 
Mask (CSM)~ 



Breakpoint Disable 
Mask (BDM) 

^Breakpoint Flag and 
Breakpoint Strand 



8 Functional Units 

8.1 Introduction 

This chapter describes each of the functional unit types required in the architecture 
These represent the minimum set of units that are required to build a processor Some 
units associated with debugging are optional. 

8.2 Memory Units 
8.2.1 Description 

A memory unit is used to hold 32 bit word sized data. The size of the memory is 
completely configurable. A CriticalBlue processor has a single main address space in 
which both code and data are stored. The data addresses used in the memory 
correspond exactly to those that would be used on the host processor. By default the 
main memory is directly accessed and is not cached. 

The basic architecture of a memory unit is shown below: 
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Basic methods are supported to read and write the memory. Write operations specify an 
access size so that the appropriate byte enables can be asserted for a subword write. The 
byte shift unit performs the alignment of data for subword reads and writes. This avoids 
the need for data shift and sign extension logic in the memory unit itself. 
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There must only be one main memory unit in the processor. This is because the 
execution semantics assume that there is a single unified data memory space. The 
restriction of a single memory unit can significantly impact the potential parallelism in an 
application as memory is accessed very frequently in most applications. To alleviate this 
bottleneck then the main memory unit of the process may have multiple ports. Each of 
those ports must access the same address space. The code generation will ensure that if 
a write operation is being performed then it will not be aliased with any other read or write 
on the same cycle. This avoids any complexity associated with forwarding aliased 
accesses or resolving the priority of multiple simultaneous writes to the same location. 

The user can define additional explicit memory units. These may be accessed by special 
functions, such as operator overloading of array accesses. These explicit memory units 
may be of user defined narrower width than the 32 bits of main memory. Additional 
memory units do not support subword accesses. This ability to create user defined 
memory units is very powerful and allows additional parallel memories to be used for DSP 
like operations. This is similar to the separate X and Y memories typically found in DSP 
architectures. Again, an explicit memory unit may have multiple ports. 

The operands and result ports detailed below must be repeated for each port to the 
memory unit. Each port must be capable of supporting both read and write operations. 
Obviously the memory implementation becomes slower and less area efficient as 
additional access ports are added. The degree of degradation is obviously dependent 
upon the underlying implementation technology being employed. The design must 
balance these factors against the greater parallelism afforded by providing multiple 
register ports. 



8.2.2 Methods 



METHOD 


DESCRIPTION 


Read 
Word 


Reads a word from the memory. The lower two address bits are. ignored. 


Write 
Byte 


Writes a byte value to the memory. The lower 2 address bits are used to 
determine which byte within the word is to be written. The data must be 
aligned prior to the write. This is not supported for explicit additional 
memory units. 


Write 
Short 


Writes a 16 bit value to the memory. The lower 2 address bits are used to 
determine which bytes within the word are to be written. The data must be 
aligned prior to the write. This is not supported for explicit additional 
memory units. 


Write 
Word 


Writes a word to the memory. The lower two address bits are ignored. i 



8,2.3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Address 


32 


The address of the location being accessed. The least 
significant two bits of the address are used for subword writes 
to determine which individual bytes in the word are being 
written. 


Write 
Data 


32 


The data being written for a write method. If a subword write is 
being performed then the byte shift unit must be used prior to 
the write in order to align the data within the word. 
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| NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Read 
Data 


32 


The data read from the memory. If a subword read is being 
performed then the byte shifter unit must be used to properly 
align and sign extend the data. 



8.3 Branch Unit 
8.3.1 Description 

This unit allows branches to be issued that are processed by the branch control unit In 
contrast to existing architectures, the issue point of branches in a CriticalBlue processor is 
decoupled from the actual point of branch initiation. This allows much greater scheduling 
freedom for branches, allowing them to be issued out of order and before the associated 
branch condition has been evaluated. The branch control unit is able to select one of 
multiple branch destinations at the end of a region. The branch itself occurs at the end of 
region execution. 

The strand number in which the branch was issued is used to determine which slot in the 
branch control unit the destination is held. Only one branch may be performed for each 
strand. If the strand is not committed then the branch is not taken. A branch from a lower 
numbered committed strand may usurp the branch. If a conditional branch is beinq 
performed Ihen the branch itself is placed in a separate strand that is squashed as 
appropriate by the condition. 

Branch destinations are specified as an absolute address of a region combined with the 
number of the first strand to be executed at the destination. The address contains a 
memory bank specifier that allows addresses in the shadow memory to be represented 



8.3.2 Methods 



METHOD 


DESCRIPTION 


Branch 


Performs a basic branch operation. This is used to implement all 
branches, calls and normal returns in the architecture I 


Breakpoint Branch 


Used to specify the destination for a breakpoint branch. Such a 
branch is issued in each region of the main code and specifies the 
location of the associated shadow code for the region. The branch 
is only taken if a breakpoint is encountered 


Lock Down 
Branch 


Performs a lock down branch. This is used in the entry code to 
lock the regions for fast interrupt handlers into the instruction 
buffer. This branch is executed in a special mode so that the 
called region is allocated to the lock down area of the instruction 
buffer. 


Interrupt Return 


Performs a return from an interrupt handler. No destination is 
specified as it is obtained from the interrupt shadow version of the 
RBA register. 


Monitor Return 


Performs a return from the monitor. No destination is specified as 
it is obtained from the monitor shadow version of the RBA 
register. 


Set Monitor 
Handler 


Used to set the handler address for the monitor. This branch is 
taken when executing shadow code and no specific destination 
branch has been selected. 
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Set Interrupt 
Handler 


Used to set the handler address for interrupts. These are 
specified at the start of the code execution and are then held 
within registers in the branch control unit. The strand number is 
used to specify the interrupt number whose handler is being set. 


Set Main 
Breakpoint 


Sets a main breakpoint register in the branch control unit. The 
destination region address and strand specify the location at 
which a breakpoint event should be generated. If a breakpoint is 
encountered then the breakpoint branch for the region is taken. 
There are 8 different variants of this method to select the required 
breakpoint register number. This method is only executed if the 
processor is in monitor mode. 


Clear Instruction 
Buffer 


Clears all cached entries from the instruction buffer (for 
processors using instruction buffer caching). Issued by the entry 
code before locking down fast interrupt handlers to ensure that 
any previously cached regions are evicted. 


Disable Interrupts 


Disables all interrupts. Usually mapped to a specific function to 
allow interrupts to be disabled from a high level language. Issued 
at the start of the entry code to prevent interrupts occurring while 
the interrupt handler vectors are being set up. 


Enable Interrupts 


Enables all interrupts. Usually mapped to a specific function to 
allow interrupts to be enabled from a high level language. Issued 
after the interrupt handlers have been set up during the entry 
code. 



8,3,3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Destination 


24 


The destination address. This is generally transported from 
an immediate unit but may be from a memory loaded value 
to support indirect branches (including returns). The 
destination includes 4 bits for specifying the base strand. 
This field is used to specify an extended address for 
breakpoint branches to address the shadow memory. The 
strand associated with the operand determines which strand 
the branch is associated with. 



8.3.4 Results 

The branch unit generates no results. 

SA Immediate Unit 
8.4.1 Description 

An immediate unit is used to provide a particular literal value for use by a subsequent 
operation. The literal value is obtained directly from the execution word, via the method 
field for the immediate unit. 

A range of immediate fields for loaded literals of different widths is provided. Shorter literal 
values occur more frequently than longer ones. The code generation algorithms always 
use the shorter possible literal in order to maximize code density. At least one immediate 
unit is provided that can load literal values at least as wide as the number of bits in the 
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address space of the processor. This allows addresses to be loaded usinq a sinale 
immediate operation. a 

All immediate values are sign extended to 32 bits. The internals of the immediate 
execution unit is simply a latch of the method input to delay it by one clock and some sian 
extension logic. " a 

8A2 Methods 

An immediate unit only has a single method to load an immediate value from the 
execution word. The method field is used to hold the immediate value itself so the width is 
dependent upon the size of immediate being loaded. 

Some immediate units may also have an immediate RSN value. This is provided as a 
separate bit field from the execution word. 

8.4.3 Operands 

An immediate unit has no additional operand inputs. Since it has no operand inputs no 
strand number is selected to gate the execution for the unit Immediate operations are 
always executed. This is a requirement since immediate operations are used to set the 
original strand numbers for subsequent operations in a strand. 



8.4.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION ~ ~ 


Immediate 
Value 


32 


The signed extended immediate value extended to 32 



8.5 Addition Unit 
8.5.1 Description 

The addition unit supports basic 32 bit addition between two operands. It is used in 
circumstances where no condition code results are required. In particular the unit is used 
for address calculation operations. Providing the basic addition functionality in a separate 
unit allows multiple instantiations of it to be provided without the overhead of a more 
general purpose unit. 



8.5.2 Methods 

This unit supports only a single method. 



METHOD 


DESCRIPTION 


Add 


Performs a basjc 32 bit addition between the 
operands 



8.5,3 Operands 



I NAME 


WIDTH (BITS) 


DESCRIPTION 


Left 


32 


Left operand for addition 

> 


Right 


32 


Right operand for addition 
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8.5 A Results 



NAME 


WIDTH (BITS) 


DESCRIPTION 


Result 


32 


Result from addition calculation 



8.6 Arithmetic Unit 
8.6.1 Description 

The arithmetic unit is used for basic addition and subtraction calculations. It implements 
the add and subtract instructions of the ARM architecture. Note that the reverse subtract 
is implemented by simply reversing the order of operands to the subtract unit. Methods 
are supported that perform addition and subtraction with the carry flag. 

Condition flag bits are produced and may be optionally read after a calculation. Simple 
addition operations for which no result flags are needed may be implemented using the 
addition unit. 



The arithmetic unit is used for performing compare operations. These are followed by 
squash operations that are dependent on one or more bits from condition codes. If a 
compare is being performed then the actual result value is not used. 



8.6.2 Methods 



METHOD 


DESCRIPTION 


Add 


Performs a 32 bit addition between the operands 


Add with Carry 


Performs a 32 bit addition between the two operands and the carry 
operand 


Subtract 


Performs a 32 bit subtraction between the operands 


Subtract with 
Carry 


Performs a 32 bit subtraction between the operands the carry 
operand 



8.6.3 Operands 



|| NAME 


WIDTH (BITS) 


DESCRIPTION 


Left 


32 


Left operand for calculations \ 


Right 


32 


Right operand for calculations 


Carry 
In 


1 


Input carry bit used for add/subtract with carry methods 



8.6.4 Results 



NAME 



WIDTH 
(BITS) 



DESCRIPTION 
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Result 


32 


Result from arithmetic calculation 


Carry Out 


1 


Set if a carry was generated as result of the calculation 


Zero Out 


1 


Set if the result of the calculation was zero 


Negative 
Out 


1 


Set if the result of the calculation is negative (same as bit 31 
of the result) 


Overflow 
Out 


1 


Set if the calculation generated an overflow into the sign bit 



8.7 Squash Unit 

8.7.1 Description 

The squash unit is used to update the Strand Predicate Mask (SPM) to indicate which 
strands are to be executed. Upon initial entry to a region the SPM has all bits set to 
indicate that all strands are to be executed. Squash operations are then issued to clear 
bits in the SPM to squash particular strands depending upon the control flow structure 
and conditionals within the region. 

The input operand condition for the squash unit is generally derived from a precedinq 
subtract operation in the arithmetic unit. 

8.7.2 Method Fields 

The method parameter is broken into a number of individual fields to indicate which set of 
strands is associated with the squash. Another bit is also used to invert the condition. 



METHOD FIELD 


FIELD 
WIDTH 


DESCRIPTION 


Invert 


1 


If set then inverts the condition bit 


Base Strand 


4 


The first strand to be squashed 


Number of 
Strands 


2 


The total number of contiguous strands to be squashed 
starting from the base strand. A value of 0 indicates that 4 
strands should be squashed. If a larger number of strands 
need to be squashed then additional squash operations 
are generated. 



8,7,3 Operands 



NAME 


WIDTH (BITS) 


DESCRIPTION 


Squash 
Bit 


1 


The condition squash bit 



8.7 A Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Squash 
Vector 


16 


Bit vector showing which strands are to be squashed 
(assumes 16 strands). This output is actually a dedicated 
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port (hw_squash_x) 
strand control unit. 



that is connected directly to the 



8.8 Logical Unit 
8.8.1 Description 

The logical unit is used for implementing the logical operations of the host instruction set. 
Basic AND, OR operations are supported. Condition code bits are produced that may be 
read if required. 



8.8.2 Methods 



METHOD 


DESCRIPTION 


And 


Performs a bit wise 32 bit AND between the operands 


inclusive-Or 


Performs a bit wise 32 bit inclusive OR between operands 


Exclusive-Or 


Performs a bit wise 32 bit exclusive OR between operands (an 
exciusive-Or with all 1's used to form a bit wise NOT) 



8.8.3 Operands 



I NAME 


WIDTH (BITS) 


DESCRIPTION 


Left 


32 


Left operand for logical operation 


Right 


32 ; 


Right operand for logical operation 



8.8.4 Results 



j NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Result 


32 


Result from the logical operation 


Zero Out 


1 


Set if the result of the operation was zero 


Negative 
Out 


1 


Set if the result of the operation is negative (same as bit 31 
of the result) 



8.9 Byte Shift Unit 
8.9.1 Description 

The byte shift unit is able to perform byte alignment operations within a word. It is used in 
conjunction with the bit shifter unit to perform arbitrary shifts of a word. The combination of 
the byte and bit shifter avoids the need for a full barrel shifter implementation. 

Methods are supported for performing rotates as well as arithmetic and logical shift 
operations. 
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The byte shift unit is also used for aligning data prior to a subword write and after a 
subword read. A byte align method moves subword data into the correct position in a 32 
bit word prior to a write. This avoids the need for such a shifter unit integrated into the 
memory unit. Byte and short alignment methods in both unsigned and signed varieties are 
supported. This shift a loaded 32 bit word so that the right data is present in the least 
significant bytes of the word. This operations are able to operate from a supplied byte 
address directly. 



8.9.2 Methods 



METHOD 


DESCRIPTION 


Rotate* Rinht 
La lc? r\iyi (i 


Perform q s mtat^ rinht nnorotinr* R\/too ohiftoH "frv^m thes Unot 

roiiuniio ci tuiaic iiyiu vjpt;i diiuii. Dyies onureu iTorn me least 
significant byte are moved to the most significant byte of the word. 
Bits 4-5 of the shift amount determine the size of the shift. If these 
bits are 0 then no shift is performed. 




Porfnrrnc: a mtafr* lp>ffr nnoratinn R\/foo chifFoH from ih^ mncf 
rtSIIUIIIlo a lULaLt; loll UfJol dUUI 1. Dylco bnilLcU HOiTI UlS illOSI 

significant byte are moved to the least significant byte of the word. 
Bits 4-5 of the shift amount determine the size of the shift. If these 
bits are 0 then no shift is performed. 


Arithmetic Shift 

Rinht 


Performs an arithmetic shift right operation. The most significant 
un io uupnuditju uiruuynuui any 01 ine mosi signrncani oyies tnai 
are shifted. Bits 4-5 of the shift amount determine the size of the 
shift. If these bits are 0 then no shift isj)erformed. 


Logical Shift Right 


Performs a logical shift right operation. The most significant bytes 
that, are shifted are filled with 0 bits. Bits 4-5 of the shift amount 
determine the size of the shift. If these bits are 0 then no shift is 
nprformpd 


Logical Shift Left 


Performs a logical shift left operation. The least significant bytes 
that are shifted are filled with 0 bits. Bits 4-5 of the shift amount 
determine the size of the shift. If these bits are 0 then no shift is 
performed. 


Byte Align 


Performs a byte align operation. This is a rotate left operation with 
the number of bytes to be shifted determined by bits 0-1 of the 
oimi diiiuum. 1 1 iio upuictuun is perrormsa on in© uaia prior to a 
subword write in order to move the data into the correct position 
within the word. The address is supplied as the shift amount. 


Unsigned Byte 


Performs an unsigned byte extension. This is similar to a logical 
shift right operation. The active byte from a loaded 32 bit data 
item is shifted to least significant byte of the result. All higher bytes 
in the word are cleared to 0. This is used after a subword load 
operation. The shift amount is determined from bits 0-1 of the 
operand, which is supplied with the loaded address. 


Signed Byte 


Performs a signed byte sign extension. This is similar to an 
arithmetic shift right operation. The active byte from a loaded 32 
bit data item is shifted to least significant byte of the result. All 
higher bytes are duplicated with the most significant bit of the 
byte. This is used after a subword load operation. The shift 
amount is determined from bits 0-1 of the operand, which is 
supplied with the loaded address. 


Unsigned Short 


Performs an unsigned short (16 bit word) extension. This is 
similar to a logical shift right operation. The active short from a 
loaded 32 bit data item is shifted to least significant 16 bits of the 
result. All higher bytes in the word are cleared to 0. This is used 
after a subword load operation. The shift amount is determined 
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from bits 0-1 of the operand, which is supplied with the loaded 
address. 


Signed Short 


Performs a signed short (16 bit word) sign extension. This is 
similar to an arithmetic shift right operation. The active short from 
a loaded 32 bit data item is shifted to least significant 16 bits of 
the result. All higher bytes are duplicated with the most significant 
bit of the byte. This is used after a subword load operation. The 
shift amount is determined from bits 0-1 of the operand, which is 
supplied with the loaded address. 



8.9,3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Value 


32 


The input data value for the shift. 


Shift 
Amount 


5 


The shift amount. This may be read from bits 4-5 or 0-1 
depending upon the method invoked. 



8.9.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Result 


32 


Result from the byte shift operation 


Carry 
Out v 


1 


The last bit that has been shifted out of the word. For a right 
shift this is the most significant bit of the most, significant byte 
that is shifted out of the word. For a left shift this is the least 
significant bit of the most significant byte that is shifted out of 
the word. 



8.10 Bit Shift Unit 
8.10.1 Description 

Performs a bit shift operation. This unit is able to perform shifts of up to 7 bits. The byte 
shifter must also be used to perform longer shifts. A combination of a byte and bit shifter is 
used to avoid the need for a full barrel shifter implementation. 

Methods are supported for rotating left and right as well as arithmetic and logical shifts. A 
rotate right extended allows a carry input bit to be shifted into the word. 



8.10.2 Methods 



I METHOD 


DESCRIPTION 


Rotate Right 


Performs a rotate right operation. The bits, which are shifted from 
the least significant end of the word, are rotated into the most 
significant bits of the result. 


Rotate Left 


Performs a rotate left operation. The bits, which are shifted from 
the most significant end of the word, are rotated into the least 
significant bits of the result. 


Arithmetic Shift 
|l Right 


Performs an arithmetic shift right operation. The most significant 
bit of the word is duplicated into the most significant bits of the I 
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result. 


Logical Shift Right 


Performs a logical shift right operation. The most significant bits of 
the word are cleared to 0 are bits are shifted right. 


Logical Shift Left 

V 


Performs a logical shift left operation. The least significant bits of 
the word are cleared to 0 are bits are shifted left. 


Rotate Right 
Extended 


Performs an extended right shift. Only a shift of one bit is 
performed independently of the shift amount. The carry in bit is 
copied into the most significant bit of the word. 



8,10,3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Value 


32 


The input data value for the shift. 


Carry In 


1 


The input carry bit for shifts. This is used for a rotate right 
extended. If the shift amount is 0 then this value is 
duplicated to the carry out result (so that the carry result 
from a previous byte shift is passed through) 


Shift 
Amount 


3 


The shift amount. 



8.10.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Result 


32 


Result from the bit shift operation 


Carry 
Out 


1 


The last bit that has been shifted out of the word. If no shift was 
performed because the shift amount was 0 then this is given 
the same value as the carry in operand. 



8.1 1 Multiplication Unit 
8.11.1 Description 

Performs a 32 bit by 32 bit' multiply operation. A 64 bit result is produced as two 
independent 32 bit results. In some cases on the lower 32 bits of the result are used. 
Signed and unsigned multiplication methods are supported. 



8.11.2 Methods 



METHOD 


DESCRIPTION 


Signed Multiply 


Performs a signed multiply operation 


Unsigned 
Multiply 


Performs an unsigned multiply operation 
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NAME 


WIDTH (BITS) 


DESCRIPTION 


Left Operand 


32 


The left operand for the multiply 


Right 
Operand 


32 


The right operand for the multiply 



8.1 1.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Lower Result 


32 


Bits 0 - 31 of the multiply result 


Upper Result 


32 


Bits 32 - 63 of the multiply result 


Zero Out 


1 


Set if the result of the multiply is 0 


Negative 32 bit 
Out 


1 


Set to the same value as bit 31 of the result (the sign bit 
for 32 bit results) 


Negative 64 bit 
Out 


1 


Set to the same value as bit 63 of the result (the sign bit 
for 64 bit results) 



8.12 Register File 
8.12.1 Description 

Register file units are used to hold the current contents of 32 bit registers in the processor. 
The register file is a standard functional unit that is accessed explicitly by the program 
code to read and write a register value. 

Register read operations specify an immediate RSN value. This allows a register value to 
be read for use by a particular strand and the strand dependency to be flowed into 
dependent operations in the strand. 

A method is supported to provide a combined read and write operation to a single 
register. This is useful when a value written in one strand is to be immediately used by 
operations in another strand. Only a single access port is needed to the register file itself 
as the written value is simply bypassed to the register read output. If the writing strand is 
actually disabled then the combined operation is simply interpreted as a read. 

In general there is a significant requirement for register reads and writes. A single ported 
register file can significantly restrict the amount of the parallelism that can be extracted 
from an application. In general the register file will support a number of independent 
access ports. However, the number of ports will be less than that normally required to 
support the same level of parallelism in a traditional Superscalar or VLIW implementation. 
Each of the access ports will replicate the method, operand and result ports as shown 
below. Note that the code scheduling will ensure that a particular register is only accessed 
by a single port on any given clock cycle to simplify the implementation of the multiple 
access port schemes. 
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BA2J2 Method Fields 



METHOD FIELD 


FIELD 
WIDTH 


DESCRIPTION 


Read Flag 


1 


If set then indicated that a r»niQtor r»ari c*k^m im 
performed rather than a write 


Merge Flag 


1 


If set then indicates that both a write and a read are being 
performed to the same register. If the writing strand is 
enabled then the data being written is simply passed 
through as the result (although can be labeled with a 
different strand number). If the writing strand is disabled 
then a standard register read is performed. 


Register 
Number 


5 


The number of the register that is to be accessed. 



8.12.3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION ~ 


Register Write 
Data 


32 


The data that is to be written into the register if a write is 
being performed. 



8.12.4 Results 



I NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Register Read 
I Data 


32 


The data read from the register if a read was performed 



8.13 Commit Unit 
8.13.1 Description 

The commit unit is used to specify that a particular strand should enter its commit phase. 
Once a strand is in its commit phase it can issue operations that permanently change the 
machine state (such as stores). A strand enters its committed phase on the following 
clock cycle. 

If the strand being committed has issued a branch instruction then there is latency before 
the following commit can be issued. This is because the branch control unit must 
determine if the branch is going to be taken. If so then higher numbered strands' are 
squashed. A number of clock cycles are required for this resolution to take place and the 
updated strand enable mask distributed so that the subsequent strand does not enter its 
committed phase if it has been squashed by an earlier branch. 

In general strands are committed in ascending order. That is a strand cannot be 
committed until. all lower numbered strands have been committed. This ensures that the 
correct branch resolution and store order is maintained. However, the code generator 
may choose to issue some commit operations out of order. In that case the strand commit 
will be preceded by a guard operation. This will abort then strand if it is not the first being 
executed in the region. This mechanism allows improved code density via the reordering 
of strands within the region. 
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The processor is able to support multiple commit units. This allows improved scalability in 
the architecture so that many strands can be efficiently processed in a single region. 

8.13.2 Methods 

The commit unit only has a single method to initiate a commit. The method field is used to 
specify the number of the strand that is being committed. 

8.13.3 Operands 

The commit unit has no operands. 



8.13.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Commit 
Strand 


4 


Shows which strands should be committed. This is simpiy 
the strand specified in the method field. This result is 
supplied one clock cycle earlier than other results to 
reduce the commit-committed latency, so that The reading 
strand is aborted if the read and write. Note that this is a 
dedicated output that is connected to a hw commit x 
port. | 



8.14 Instruction Predicate Unit 
8.14.1 Description 

The main breakpoints in the branch control unit allow a breakpoint to occur on a specific 
region and strand in the main code. The region executes a breakpoint branch to enter the 
shadow code. The instruction predicate unit allows the implementation of breakpoints in 
shadow code. These may be set at the granularity of instructions in the original code. This 
is done by setting breakpoints using the original instruction addresses in the instruction 
predicate unit. Each individual original instruction is guarded by an instruction predicate 
operation. Execution continues until the original instruction on which the breakpoint was 
set is reached. 

There is support for up to 8 breakpoints to be set in the processor. The monitor code 
manages those breakpoints and communicates with the instruction predicate unit. Indeed, 
all methods associated with the predicate unit are only performed if executed from the 
monitor execution level. 

Once a breakpoint is reached single stepping is achieved by advancing the breakpoint 
address to the next instruction. Execution is then restarted in the shadow code in and the 
instruction predicate only enables execution for the duration of the targeted instruction. 
The instruction execution will update the value of the PC register, allowing the breakpoint 
to be advanced to the next instruction. When normal execution needs to be restarted the 
instruction predicate unit is put into continue mode. Execution then continues with the 
targeted arid all subsequent instructions. 



8.14.2 Methods 



METHOD 


DESCRIPTION 


Test 


Method executed at the start of each original instruction in the 
shadow code. The original instruction address is supplied as an 
operand and a check is made to determine if a breakpoint has 
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been reached. The instruction predicate output is then used to 
gate the SEM to enable or disable execution as appropriate. 


Set Shadow 
Breakpoint 


Sets a shadow breakpoint. Eight different variants of this method 
are supported for each of the supported breakpoint registers. 
Writina an unreachable adrirp«;«? Hparc; hr^aknointQ 


Single Step 


Forces the instruction predicate unit to enable execution for one 
instruction after a breakpoint hit. This is used to allow single 
stepping of original instructions. The itionitor code updates the 
breakpoint after execution in order to single step through the 
code. The updated PC value is used to generate the new 
breakpoint. 


Continue 


Allows an execution continuation after a breakpoint has been 
reached. Causes the instruction predicate to enable execution 
after a breakpoint hit for all subsequent instructions. 



8.14.3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Address 


32 


During the setting of a breakpoint this gives the original 
address of where the breakpoint is being set. This is held in an 
internal register within the instruction predicate unit. During - 
execution of shadow code this operand is used to specify the 
current original execution address. It is compared against the 
set breakpoints. 



8.14.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Breakpoint Hit 


1 


Set if there was a breakpoint hit since the last update to 
the unit 


Breakpoint 
Number 


3 


Gives the number of the breakpoint encountered if there 
was a hit 



8.15 Debug Communication Unit 
8.15.1 Description 

The unit provides bi-directional serial communications with a host system. This is used to 
send and receive monitor related commands and information for debugging. Methods 
may only be performed on the unit when executing at monitor level. 



8.15.2 Methods 



METHOD 


DESCRIPTION 


Send 
Byte 


Method to send a byte of data to the host. 
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NAME 


WIDTH (BITS) 


DESCRIPTION 


Send 
Data 


8 


The byte of data to be sent to the host 



8.15.4 Results 



NAME 


WIDTH (BITS) 


DESCRIPTION 


Data 
Available 


1 


Set if data has been received from the host 


Received 
Data 


8 


The byte of data received from the host 



8.16 Shadow Memory Unit 

8.16.1 Description 

This unit controls the interface to shadow memory. This memory is required when 
performing debugging and contains the versions of the code that allow single step debug. 
The shadow memory also contains code and potentially the monitor control code itself. 

The shadow memory unit accesses the external shadow memory itself using a low pin 
count serial interface. Thus accesses to the memory are slow. The wait signal is used to 
extend accesses to the memory as required. 

The instruction control unit may utilise this memory unit directly in order to load code 
regions into the instruction buffer that are held in shadow memory. 

8.16.2 Methods 



METHOD 


DESCRIPTION 


Read 


A 32 bit read from shadow memory. This operation is ignored unless 
performed by code running at the monitor level. 


Write 


A 32 bit write to shadow memory. This operation is ignored unless 
performed by code running at the monitor level. 



8.16.3 Operands 



I NAME 


WIDTH (BITS) 


DESCRIPTION 


Address 


32 


The address in shadow memory that is to be accessed 


Write 
I Data 


32 


The data for a write to shadow memory 



8.16.4 Results 



NAME WIDTH (BITS) 



DESCRIPTION 
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Read 
Data 



32 



Data that has been read from the shadow memory 



8.17 Power Control Unit 
8.17.1 Description 

This unit initiates a power down operation. Software may implement a power down if 
there are no further operations to be performed. A power down is implemented by 
causing a pipeline stall until an interrupt is received. An interrupt wakes the processor up 
again and execution continues from the point of the power down. For the power down to 
be efficient, execution units must utilise the clock gating signal from the pipeline control 
glue unit to either mask the clock or reduce power requirements in some other manner. 



8.17.2 Methods 



METHOD 



DESCRIPTION 



Power 
Down 



Powers down the processor by forcing a pipeline stall until an interrupt is 
received. 



8.17.3 Operands 

The unit has no operands. 

8.17.4 Results 

The unit generates no results. 

8.18 Check Hazard Unit 

8.18.1 Description 

The check hazard is a simple address comparison unit that is provided to enable 
speculative loads to be issued. The unit is supplied with a read and a write address. If the 
two addresses match then the strand supplying the read address is aborted. 

The code generator automatically schedules check hazard operations for all loads that 
have been boosted earlier than temporally preceding store operations to which they may 
be aliased. If the reading strand is aborted then and is not subsequently squashed then it 
is re-executed in order to obtain the correct value from the store. Check hazard operations 
may only be issued in the speculative phase of the reading strand. 

Multiple check hazard units are generally supplied to allow loads to be boosted across 
several store instructions if required. 

8.18.2 Methods 

The unit only has a single method to check for hazards. 



8.18.3 Operands 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Write 
Address 


32 


The address of a write from an earlier strand. The strand 
number for the operation is obtained from this operand. If 
the earlier strand containing the write is disabled then the 
operation is disabled. | 
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The data for the boosted read from a later strand. The read 
has been issued earlier than the temporally earlier write 
This operand is used to select the corresponding strand 
number of the reading strand. The reading strand is 
aborted if the addresses matoh. 



8.18.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION 


Abort 
Strand 


4 


Shows which strans should be aborted. The reading strand 
is aborted if the read and write addresses match. This 
forces the reading strand to be re-executed if it has not been 
squashed. Note that this is a dedicated output that is 
connected to a hw abort x port 



8.19 Guard Unit 

8.19.1 Description 

I!?L 9Ua ? ^ t0 P revent execution of a strand unless it is the lowest 

numbered strand being executed in a region. Use of the guard operation allows certain 

jassasw 56 performed * at wou,d not ~ e be possib,e s 

The usual dependencies between the completion of strands can be broken. Any chain of 
£ rea * the ^pendency are preceded by a guard instruction. If thesfrSS t 

25F2T d "2" Pre ?T haVe n0t been then the strand is aborted. The 

abort causes the region to be re-executed (unless the strand is squashed) On the 

bfaSed. " Strand ^ bS ** l0W6St b6ing executed 80 *» 9 ua ' d «■ not 

8.19.2 Methods 

iTn?.^ un '"t effe t ctiv ely only has a single method. The method field is used to specify 
the number of the strand that is being guarded. ****ny 

8.19.3 Operands 

The guard unit has no operands. 



8.19.4 Results 



NAME 


WIDTH 
(BITS) 


DESCRIPTION I 


Abort 
Strand 


4 


Shows which strand should be aborted. The guarded strand 
is aborted if it is not the lowest numbered strand enabled in 
the SPM. This forces the region to be re-executed in order to 
complete the strand. Note that this is a dedicated output that 
is connected to a hw abort x port. 



8.20 Explicit Functional Units 

The user defined functional units form the bulk of any CriticalBlue processor. All the fixed 
control and functional units (with the exception of the code memory) only amount to a few 
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thousand gates in size. The code memory dimensions are under the control of the user. 
This allows a very highly scalable architecture in terms of area, parallelism and overall 
performance. 



9-1 Overview 

The CriticalBlue tools provide a powerful simulation environment They enable a complete 
system to be simulated to provide accurate results of how a real system would perform. 
The system is able to provide: 

□ Bit Accurate Results: The patterns of data left in memory or performed in 
simulated I/O are exactly the same as they would be in a real system. 

□ Cycle Accurate Results: The number of execution cycles required to perform a 
programmed set of system operations is exactly the same as it would be in the 
real system. 

Moreover, the system is able to do this using only behavioural models of the hardware 
units. The behavioural code only has to be capable of generated the same final results as 
the real hardware would. It doesn't have to exactly replicate all the internal states of the 
hardware.. This allows trade off decisions between hardware/software to be made without 
having to actually design the proposed hardware. 

All the simulation is performed in a C/C++ environment. The simulation runs natively on 
the machine hosting the CriticalBlue software tools. Since the simulation is in C++ it is 
open to the user to debug using standard development tools. The user is also able to 
instrument the code as they choose to capture information or statistics about the 
execution. Utilities can be used to generate dumps of parameter information to be shown 
later in waveform viewers. 

The standard method for simulating a particular processor is to use an Instruction Set 
Simulator (ISS). Given that the instruction set of a CriticalBlue processor is variable this 
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would require a parameterised ISS. CriticalBlue does not use this approach but instead 
allows the generation of C/C++ code directly from the CriticalBlue tools. This code is 
exactly equivalent to the machine code that would normally be produced. Thus it is cycle 
accurate. Instead of individual instructions, function calls to the behavioural models of the 
execution units are made. The code obeys the constraints of the pipeline and models the 
latencies of individual hardware units. 

The approach has the secondary advantage that the user is able to view the code that 
has been generated for a particular block of source without the need for a re-targettable 
disassembler. The code is shown in the form of the function calls to particular hardware 
units. This avoids needs for hard to read and remember mnemonic forms. 

The simulation environment is quite similar to that provided- for SystemC. However, the 
CriticalBlue environment only requires a single execution thread. The SystemC simulation 
kernel uses a co-routine approach to model multiple parallel execution units. The two 
environments have been are designed to co-exist if necessary. A CriticalBlue hardware 
unit can be modelled in SystemC using all the available constructs. Conversely, a 
CriticalBlue processor running its target software can be modelled as a single SystemC 
module in a larger simulation. 

The simulation environment allows multiple CriticalBlue processors (with different 
instruction sets running different code) to be simultaneously modelled. If SystemC is being 
used then a standard processor core can also be simultaneously simulated. Thus the 
CriticalBlue simulation provides a powerful and open environment that allows an entire 
system to be fully modelled in software. 

BJ2 Implementation Principles 

9.2.1 Behavioural Modelling 

The simulation is performed using the behavioural models of the control and functional 
units in the system. Behavioural models are supplied for all such library units in the 
system. If the user adds more execution units to be used in the processor then suitable 
behavioural models must be supplied. 

All behavioural models are untimed. They therefore execute synchronously and 
immediately return results. The parameters supplied to and the results returned from the 
behavioural model correspond exactly to the operands and results from the hardware 
equivalent. For user extension units the behavioural model has exactly the same structure 
as the software function called in the original code that has been mapped to a hardware 
function. 

The behavioural models for functional units are called as required by code in the program. 
If the program does not use a particular unit then its behavioural model is not called. All 
the control unit models are called on every cycle. They are called within the clock update 
function that is called on each clock. Any other function that starts with the prefix 
hw_clock_ is also called from within this function. Thus the models for particular 
functional units can have behavioural modelled on a clock by clock basis by declaring 
such a function. 

9.2.2 Bus Representation 

All behavioural models are functional. That is they read a number of parameters on each 
clock cycle and return a number of results. Results may be returned s the main return 
from the function and/or via reference parameters. The models should not directly access 
or share global variables directly with models of other units. The parameters and results 
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effectively represent the names of the buses used to interconnect all the units. The input 
parameters correspond to the state of the buses at the start of the cycle. The results 
returned determine the new state of the named buses at the start of the following cycle. 

The buses are modeled as global variables. The bindings between the global variables 
and the parameters are automatically handled by the simulation code generation. The 
behavioural models that need to be executed on each cycle are called in the order they 
are declared in the hardware configuration file. This allows a known call order if there are 
any interactions between them. If a particular model updates the state of a bus then the 
updated value should not be made available to other units on the same cycle. To model 
this behaviour, temporary variables are created to hold the updated values of buses until 
all the models have been called. The updated state is then copied into the requisite global 
variables. 

9.2.3 Glue Unit Modelling 

The glue units in the processor are not directly modeled. Instead their effect is emulated 
by wrapper code around the usage of the behavioural models for the functional units. This 
significantly speeds up the simulation by reducing the amount of indirection and also 
makes the simulation code considerably more readable. How the individual glue units ate 
modeled is discussed below: 

9.2.3.1 Control Unit 

The control unit compares the execution word opcode against that required for a 
functional unit to be activated. This does not have to be modeled as a behavioural model 
for a functional unit method is only called if it is selected. This is determined fully statically. 
The other purpose of the control unit is to read the strand enable bit (from SEM) for the 
strand in which the operation is being performed and use that to gate execution. In the 
simulation this is performed by guarding the use of the behavioural method with an if 
statement. The if is predicated on the simulated value of the SEM. To make the code 
more readable a macro is actually used. Thus each simulated method call is enclosed in 
a macro to guard its execution such as SO (method_call (method params)) to 
indicate that the method should only be called if strand 0 is active. Some methods (such 
as immediate loads and register loads) are not guarded by a strand number so do not 
have to be enclosed in a guard macro. The final function of the control unit is to mask off 
methods if the pipeline is stalled. This is not required in the simulation as pipeline stalling 
is modeled by looping within the clock update function between code for a particular clock 
cycle. 

9.2.3.2 Operand Selector 

The operand selector is used to direct data from an output register of another functional 
unit to the operand of a unit being initiated. The effect of the operand selection is 
performed by the selection of the parameters that are passed to the behavioural model. 
Only parameters that are available via the selector for the particular operand may be 
used. In some cases the operand width will be different from the output register width. In 
that case the parameter usage includes appropriate code to sign extend or truncate the 
operand. 

9.2.3.3 Delay Pipeline 

The delay pipeline has two purposes. Firstly, it delays the valid method flag so that it can 
be used to enable the latching of output data from the execution unit after a delay of the 
unit's latency. Secondly, it delays the input operand values so that they may be written to 
the output directly if a copy operation is being performed. This is not required in the 
simulation since all behavioural modeling is untimed and results are written immediately. 
Thus the delay pipeline does not have to be modeled in the simulation. 



69 



CriticalBlue 



Microarchitecture Description 



9.2.3.4 Result Selector 

The result selector is used to select between the output of the execution unit or one of the 
input operands. The result selector is used to implement copy operations by the functional 
unit. The simulation implements the effect of the copy by simply assigning the input 
operand value to the output register value without calling the behavioural model for the 
execution unit. 

9.2.3.5 Run-Out 

For functional units with multiple cycle latencies a run out queue is required. This is used 
to hold results from the functional unit if there is a pipeline stall. When the pipeline is 
restarted, results are then issued from the queue for a number of cycles rather than 
directly from the output of the functional unit. This does not have to be modeled in the 
simulation since pipeline stalls are implemented by holding the execution completely for 
the required number of cycles. 

Each output bank register also contains a shadow for holding the current output state if 
the processor enters an interrupt handler. This allows the exact state to be recreated 
upon return. This is modeled in the simulation by providing a shadow form of each global 
variable used to hold the output register state from functional units. An interrupt entry 
function is automatically generated that copies the states into the shadow variables. An 
exit function is also generated that restores the copies from the shadow state into the 
main state. 

9.2.3.6 Output Bank 

The output bank is used to hold a set of results from the functional unit. Each of the 
individual output registers is modeled as a global variable that may be accessed as 
parameters for subsequent calls to the behavioural models of fractional units. For units 
with a latency of one a single output variable is required per register. It is assigned 
whenever a call is made to the behavioural model for the unit. For units with a latency of 
greater than one an array of values are required for each output bank. Each of these 
represents a particular future time value (modulo the latency of the unit). They are 
assigned accordingly. This is discussed in the section on pipeline modelling below. The 
variables are only assigned when an operation is performed on the unit so implicitly 
maintains the same value during other clock cycles. 

9.2.4 Pipeline Modelling 

the behavioural models for the control and functional units are untimed. The simulation is 
run so that it is the execute cycle that is being simulated. For functional units that have a 
latency of one clock cycle no further modeling is required. The results generated by the 
function call are written to global variables that are made available during the following 
clock cycle. 

For functional units with multiple cycle latencies further modeling is required. This is 
because in the hardware implementation such units have internal state in their pipelines 
and results are delayed for a number of cycles. This is modeled by using an array rather 
than a scalar value to represent their output results. The size of the array corresponds to 
the latency of the unit. The array effectively represents a look ahead of results that will be 
generated in the future by the unit. Thus if the unit has a latency of N clock cycles then 
results are placed in the Nth entry ahead of the current position in the array. To simplify 
the generated code, static modulo indexing is used in the arrays. The current position in 
the array gradually cycles around the array as the current clock cycle advances. The look 
ahead positions cycle around in the same manner. All reads of the array only access the 
currently available results from the unit. Thus the principle of results being utilised while 
new results are being calculated in the pipeline can be modeled, while using untimed 
behavioural models. If there are unutilized cycles in the pipeline then these do not 
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consume new elements in the array, thus allowing the holding of result values until 
overwritten by subsequent operations to be fully modeled. 

As discussed, all modeling activity occurs at the point of the execute cycle in the pipeline. 
In some cases computation occurring during the decode cycle needs to simulated. For 
instance, the model of the control glue unit needs to use the Strand Enable Mask (SEM) 
during the decode cycle. To allow this the previous state of the SEM global variable is 
copied to a history version on each clock cycle. This corresponds to its value on the 
previous cycle. It is that value that is utilised when reading the mask to determine if an 
operation is enabled or disabled. 

The commit unit is an exception in that it performs its computation during the decode 
cycle. All the operands it requires are available during that cycle. Making the results it 
produces immediately available to the strand control unit simulates this. The result is not 
delayed until the next cycle like other results. 

9.2.5 Region Modelling 

A separate function is generated for each region in the code. Each region has a function 
name that corresponds to its start address in the original code. If the region is located at 
the start of a function entry point in the original code then the function name also includes 
the name of the original function. 

If a cached instruction buffer is being used then the function starts with a check to see if 
the region is present in the instruction buffer model. If not then the code loops while a 
simulation is performed of the loading of the region into the instruction buffer. 

The region itself is modeled as a sequence of operations performed on each clock cycle. 
The code for the region is enclosed within a while loop controlled by a flag indicating if a 
region restart should be performed. Thus region restarts are modeled by looping through 
the code again. The division between clock cycles is clearly delineated. 

The code for each clock cycle calls the behavioural models for operations executed 
during the cycle. A call is made to the clock update function at the end of each clock 
cycle. This in turn calls the functions for all models that need to be updated on each clock 
cycle. This includes the model for each control unit and any timed functional models. 

9.2.6 Region Succession Modelling 

The. main loop of the simulation is used to direct execution to the next region. Upon return 
from a function that implements the execution for a particular region, the behavioural 
model for the branch control unit has an updated NRA. This is the address for the next 
region to be executed. If the same region is to be repeated then this is performed via a 
loop with the region function itself. For other successions (which may include returns and 
indirect calls) a hash table is used to find the mapping between the region address and 
the required simulation function. This table is set up as part of the initialisation of the 
simulation environment. The simulation halts with an error if the region is not present in 
the table. 

9.2.7 Pipeline Stall Modelling 

Pipeline stalls are simply modeled by looping with the clock update function. A global 
variable is used to show if there is to be a pipeline stall. Normally the variable is set to 0. If 
a behavioural model function wishes to cause a pipeline stall then it just needs to assign 
the required number of stall cycles to the variable. The next time the clock update function 
is called it loops for the required number of cycles. This can be used to model the 
behaviour of a cache, for instance. If there is a cache miss then the behavioural model for 
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the memory sets the stall counter with the number of cycles required to retrieve the 
required value. At the next clock update the pipeline is stalled for that many cycles. 

9-2-8 Interrupt Modelling 

The behavioural model for the branch control unit detects the presence of any interrupts 
This is called on every simulated clock cycle via the clock update function. If an interrupt is 
detected then after the modeled delay the interrupt handler is called. If an interrupt is 
detected then the required handler is immediately called. Making copies of all the data 
bus storage variables preserves the pipeline state. The handler is called directly from the 
clock update routine so directly interrupts the execution of the region. It thus models the 
actual behaviour of the hardware. 

9.3 SystemC Unit Modelling 

SystemC may be used to model the implementation of an execution unit in a CriticalBlue 
processor. This allows a fully timed behavioural model to be developed. SystemC 
modeling is especially useful if a particular execution unit has complex interactions with 
the rest of the system. 

An example model is shown below: 



^ J"jS^T^ -^dule ( ^ execution unit is modeled « SystemC class 
- - ' Clock must be defined for the clocked thread 

int ini , in2 , out ; Holders for Inputs and outputs declared as data members - 

other SystemC ports may be defined for other Interactions 

void S y S t e»_c_modei ( > { Definition of the Junction that actually Implements 

-function called on each clock cycle- the behaviour of the hardware unit Reads data 
-reads ml and m2 as input operands- members and places results back into data 

-can place results in out,. members 



public : 

void hw_funcl(int opl, int op2J 
ini = opl; 
in2 = op2; 

} 

int hw_func2 () {return out;} 



SC_CTOR(HW_unit) { 

SC_CTHREAI> (system_C_model, elk. pos () J 



Input wrapper function for the execution unit Places 
parameters In data members to be picked up by 
system_C_rriodel. An input wrapper may only have 
Input parameters. 

Output wrapper for execution unit May only 
return values (or have reference parameters) 



Constructor for the class sets the Implementation 
as' a clocked thread 



The hardware -unit is derived from the SystemC class sc_module. The inputs and 
outputs from the execution unit are declared as member variables. These are used to 
communicate between wrappers called from the CriticalBlue simulation environment and 
the SystemC model. 

The implementation of the execution unit is declared as a function. In the class 
constructor this is declared as a clocked thread. It reads the member variables as inputs 
to the execution unit on each clock cycle and writes results to other member variables. 

The execution unit methods that may be used in the CriticalBlue environment are 
declared as public member functions. Methods must be separated into those that have 
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input parameters and those that return output parameters. Input and output parameters 

Sn n h h m ' Xe l ,n 3 s ! n9,e ^ eth0d - This is because a ^»tamC modelled unThJS 
timed behav.our. The input method writes parameters into the data members. These are 
then retneved by the docked implementation function. Output methods simply return the 
data member values placed by the implementation. 



10 Data Caches 

10.1 Cache Organisation 

A cache has a number of individual attributes as follows: 

□ The number of sets (S): This is the number of independently accessible sets in 
the cache. A direct mapped cache only has a single set. However, direct mapped 
caches tend to be inefficient due to their poor handling of address aliases. Caches 
with higher numbers of sets are more efficient but tend to be more power and 
area hungry due to the larger number of individual memory arrays that must be 
accessed in parallel. For the sake of easier addressing, the number of sets tends 
to be a power of two. Set size of 2, 4 or 8 are common. 

□ The number blocks (B): This is the number of lines in each individual cache set 
Along with the number of sets and words per block it determines the size of the 
cacne. 



□ The number of words per block (W): This is the length of any individual cache 
line. The cache line length represents the atomicity of loading or evicting an item 
from the cache. Line lengths are a power of two and lengths of 32 bytes or 64 
bytes are common. Shorter cache lengths reduce the number of unnecessary 
words that are loaded but are less efficient in terms of cache physical area as the 
tags take a higher proportion of the area. ' 

The diagram below shows the organization of a cache for a CriticalBlue processor: 
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Two independent addresses are supplied to the cache, a tag address and a main access 
address. Normally these addresses are identical. However, they may differ if a 
Speculative Write Buffer (SWB) is being used in conjunction with the cache. If a SWB is 
being used then stores are initially held in a buffer before being later written to the cache 
For a write the tag is looked up using the write address but the main memory is written 
using a different address to write back an item already held in the write buffer. 

The lower bits of the tag address are used to access the tag array in each set The 
accessed tag is then compared against the other bits of the tag address. If there is a hit 
(i.e. all the bits match the tag) then the output from the main array is enabled. Only one 
set may match on any one access as a particular address can only be present in one set. 

The rest of the address is used to access the main memory blocks directly Normally this 
address will be the same as that applied to the tags. If not then a buffered write is being 
performed. The entry is already guaranteed to be present in the cache. The write is 
directed to the set in which the address is present. 

There is a cache miss if none of the sets match the tag address. In that case the 
requested item must be brought into memory. Various schemes, as discussed below 
may be employed to bring this about. 

A Cache Access History (CAH) block may optionally be present. This is used to monitor 
which items are being accessed from the cache.. It may be used in the cache replacement 
scheme to determine which item to replace when a new cache item needs to be brought 
into memory. The replacement scheme seeks to replace the items which are infrequently 



The diagram shows the entire cache lookup being performed in a single cycle However 
a pipelined approach may be employed. The memory arrays may themselves be 
internally pipelined or the tag comparison may be performed in a second clock cycle after 
the tag access ttself. The cache access may be pipelined to allow the processor pipeline 
to be stalled if there is a cache miss. It is not determined if there is a miss until the tags 
have been looked up and comparisons made. This may be too late in the clock cycle to 
assert a wait to stall the processor pipeline if there is a miss. The cache lookup may be 
extended by an extra clock cycle to allow a wait signal to be distributed throughout the 
processor. 

10.1.1 Cache Permission Bits 

Cache permission bits may be optionally present for each cache line. They allow page 
access properties to be imposed using the cache mechanism. All memory accesses and 
cache indexes are performed using a single address space. If virtual memory is being 
supported then all accesses are thus performed using virtual addresses. The cache is 
virtually indexed. Since there is no TLB that is lookup up on each access, page properties 
have to be set in the cache. 

Each cache line optionally has the following permission bits: 

□ Read: Indicates that reads can be performed to the cache line 

□ Write: Indicates that writes can be performed to the cache line 

□ Execute: Indicates that executable code can be loaded from the cache line 
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These bits are updated by the cache management mechanisms (either hardware or 

SX'S: ille9al operatton 00 a ^ ,ine " ien 816 operason 6 

1 0.1 .2 Cache Dirty Bits 

2t t Si™ h ° an °P erate , wlth either a write-through or write-back cache architecture In a 
^nS 9 L? S ' WnteS , are Mediately passed to the main memory as S are 
performed. Ma.n memory and cache content remain constantly in step. Y 

Higher performance can be obtained using a write-back cache. In such an architecture 
T™1*T £ pied into main romory when a cache line needs to be 555 
from the cache. This avo.ds much unnecessary copying of writes to main memorv C Sn 

im 6 n.i U m Se r q, ; 6nt,y ° VerWritten Whi,e the ,ine fe ^present in fte^3SiT2£SjS 
implementation requires a dirty bit for each line. This is set if any byte with n th^e has 
been modified. If so then the whole line is written back to main mem^Ien ft !s evicted 

fob" evicSd " " ** ' ine ^ mt ^ m ° dHied *" * ^ be SLSSiif^SA 

Mra 3K hL Pr0Vid K 8d at IT ' eVel ° f 9 ranu,arit y in cache line if required. For 

E?K2 ? 6 apphed t0 individual words or ««" indiv idual bytes within the 
line. This allows finer gram write back of modified data within the lines. 

10.2 Cache Management 

The management of the cache is performed in hardware. Cache management is required 
when there is a cache miss. That is, the program has accessed a memorWoca ton^hSta 
not currently stored within the cache. The cache management musSgfoe reoufredtne 
into the cache, potentially evicting an existing cache line in the process. q 

^S^S^SSTf^ ° n Sta " ing ? e Pipe ' ine rf the processor wiutad 
cache line is obtained. This may require a higher latency cache as the pipeline stall sionai 

must be assert ed early in the cycle prior to that in which execution is to taTSifn^ 

be possible (without unreasonably stretching the overall cycle time) if a StootaTSS 

to be performed during the cycle as well. ' 9 P ^ ds 

10.2.1 Replacement Policy 

Whenever a new cache line is to be loaded into the cache from main memorv an existina 
£? h ^ re ? lacement Po*y determines which TrS is to b XfiSTSl 

set associative cache an item from any set in the required block can be used The 
replacement policy determines which set is chosen. 

The policy must tend to select lines that are not in frequent use. Evicting frequentlv used 
ines causes thrashing between the cache and main memory and Sn have a severe 
impact on overall system performance. e e 

To help withchoosing an eviction victim there is a Cache Access History (CAH) in each 
cache un. t This maintains a referenced bit for each set to support a Least Recently Used 
(LRU) replacement policy. This is based on Deville's algorithm. Whenever TZ is read 
the corresponding bit in the CAH is set. When all bits become aetaTX > bfe for the ite 

rS JLT 8 b,0 f are reset : 7,16 cycle 016,1 be s ins ^ ^a^SSSS 

needs to be chosen for eviction the reference bits are read. A set is chosen for which toe 
corresponding bit is reset (i.e. it has not been recently used). 
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The CAH update is performed a cycle later than the access so that the hitting set for an 
access ,s known. Th,s is determined by examining which tag comparison (il f any) a 

10.2.2 Miss Handling 

The miss handler must always return the correct value on a read and successful 
EES e1^r re a^rS ne Sta " Si9na ' " ^ fo h ° ,d *— * *» '™ 

2S? B rt* ,8q T t mad6 tD ° btain required «*» ,ine - a " eSnglne muTbe 

evicted from the cache. A hardware based LRU mechanism is employed as discussed 
Previously In a wnte back cache there may a requirement to write back aline S has 
been previously modrfied. This is dene before the new line is requested Once Se new 

Sdtom Z Z?*T fr ° m V£ mem0ry * is '"to the" cache. S? revested 

p^I^rJ ♦ then copied to the output register. The pipeline stall signal is de- 
asserted and execut.on continues with the correct value obtained from the cache 

If there is a cache miss on a write then the initial procedure is similar to a read miss The 

Etl ,n n T be °K? ined fr ° m main memo ^ evictin 9 ™ exSinglinTas required 
When the new hne is obtained the write is merged into the line data The pipeSe . stall 
signal is de-asserted and then execution may continue. P 

Note that while the pipeline is stalled no interrupts are handled. Thus the maximum 
latency for an mterrupt is extended by the maximum stall time. maximum 



1 1 Multiprocessor Environment 

11.1 Application Partitioning 

L?f CalBIU l- PrOCeSSOr iS d6Signed for use as a co-processor to a main control 
processor wrth.n a system. Thus whenever a CriticalBlue co-processor is use^ the 

r^H?^ 00 " 168 8 multi r cessorone - ^ P r °cess of partitioning code to mn on a 
Cnt calBlue cc-processor .nvolves profiling the original application code and then splitting 
off the functions to be accelerated onto the co-processor. P 9 

^JftSf'S"' t0 °' re , qUireS 016 cc ^ rocessor to be obtained from a separately 
linked and self contained application. Thus direct calls between the code runninc i on the 
mam processor and the code running on the co-processor is not possible. Sace code 
.s required for ail cross calling between the two. If the code relies on global variable? or 
S£*ffl£» ^ m " no " « mUSt be -tructur^ 

«* is profiled 
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Original 
Application 



w = fund (x) ; 
f unc2 ( y, z ) ; 

±nt fund (int y) { 
> 

void func2(int x, short y) { 
} 



Main Processor Code 



w = fund (x) ; 
func2(y, z) ; 



A 



Profiling 



Co-Processor Code 



int f unci (int y) { 
} 

void func2(int x, short y) { 
) 



Information for functions that may be 
called remotely from main processor 



Remote Function Definitions 



The partitioning must take account of the required amount of communication between the 
main processor and coprocessor. If too much communication is required then the 
SSS °LT aC H Ce,era !T * *■ on *e co-processor might be substentially 

2 nni^ * 8 ? Sf l OWn ' direct 03118 from the main Processor to the co-processor is 
not possible since the functions are not within scope. 

This is highly analogous to the typical process of integrating an IP block within a system 
L p °f n f ° A J 8 ?!PP ,,catlon code also h ^ to be modified to access the registers within 
SlLfS*- Attent ' 0n mUSt a '5° be Paid t0 «» amount of communication occurring 
5?!3l!^ ma ' n p ™ essor 5* * e b,0ck - men usin 9 the CriticalBlue tools, however 
the software automatically defines the functionality of the coprocessor block There is no 
requirement to implement the functionality manually in hardware. 

As illustrated all co-processor functions that may be called externally must be marked as 
remote function definitions. A remote function definition details the name of the function 

!£JS£L! ?£Z TVS T e ?„ Umber and types of each P arameter to *» function. As 
discussed later, ttiis facilitates the automatic generation of interface code to allow the 
functions to be called from the main processor. 

1 1 .2 System Architecture 

This section describes the types of system architectures that are commonly employed 
when using the CriticalBlue tools. y ■yuyeu 

11.2.1 Independent Co-Processors 

The simplest connection topology is shown in the diagram below: 
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Main Bus Interface 
• * 





Bus Slave 






Bus slave J 1 } 


Bus Slave i 


Co-Processor 


j Co-Processor j j Co-Processor j 



The main processor connects to each CriticalBlue co-processor via its main bus interface. 
As shown there may be multiple CriticalBlue processors, each optimised to perform a 
particular function in the overall application. 

In the simplest configuration each of the co-processors is connected as a slave to the 
bus. A functional unit within the co-processor implements the bus interface as required by 
the main processor (AMBA, CoreConnect, etc.). Each co-processor is allocated an area 
of the address map of the bus. This corresponds to individual registers held in the bus 
interface within the CriticalBlue processor. These registers can be accessed by software 
running on the processor. Transmission of data from the CriticalBlue processor to the 
main processor is performed via the main processor reading registers stored within the 
that interface. 

The co-processors may also have the capability to generate an interrupt to the main 
processor in order to handle a critical event or something outside of the normal 
communication protocol. Conversely, the main processor may also be given the capability 
to interrupt any individual co-processor. 



11.2.2 Dependent Co-Processors 

The CriticalBlue software tools are able to efficiently extract micro-level parallelism out of 
sequential code. This allows efficient utilisation of multiple functional units within an 
individual CriticalBlue co-processor. 

In order to extract macro-level parallelism from an application it is necessary to restructure 
the code to split functionality across multiple CriticalBlue processors. In this scenario it is 
often desirable to allow direct communication between the co-processors without having 
to use a common bus interface. This allows much more efficient and scalable 
communication in a configuration that closely matches the requirements of the actual 
application. 



This concept is illustrated in the diagram below: 
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Critical Application^ 
Functions 




Main 
Processor 



Co-Processor 
for fa 



Co-Processor 

for ft\ 
L_ 



Co-Processor 
for fc 



Co-Processor 
for fd 



A number of critical parts of an application are identified and decomposed into a series of 
operations that are shown at the top of the diagram. Each of these functions processes 
data and passes information onto another co-processor. This represents the predominate 
data flow associated with processing in the functions. For efficient implementation each 
function must process a block of data before passing it on for further processing to 
another co-processor. a 

In this style of configuration a particular co-processor may have multiple inputs from 
and/or multiple outputs to other processors in the system. All communication is nominated 
with a particular channel identifying the source of destination of the data. Some or all of 
the co-processors may also be connected to the main processor in the system. 

11.3 Message Passing Mechanism 

The processors in the system may communicate using a message passing system 
Blocks of data are passed from one processor to another. This allows co-operation 
between processors even when, in simple systems, there is no shared memory between 
them. The communication mechanism is very similar to the CSP (Communicatina 
Sequential Processes) model used in the OCCAM language and Transputers 
Processors communicate over channels. Synchronisation occurs between the processors 
at the point of communication. A blocking send stalls the sending processor until the 
receiving processor is ready to read it. 

The individual communication primitives that are supported are described below: 
11.3.1 Communication Primitives 

A number of primitives are made available to facilitate communication between two 
processors that are connected by a channel. These primitives are fully symmetric and are 
implemented in support libraries. A channel identifier class is associated with each class 
to allow communication occurring across different channels to be differentiated: 

□ Blocking Send: Sends a block of data to a particular output channel. Various 
prototypes of the function are supplied in order to pass different numbers of 
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parameters and with different types. There must be a corresponding receive 
executed by the receiving processor that uses the same prototype If the 
prototypes do not match the function can either throw an exception or return an 
error value. The send is blocking so that a return from the function is not made 
until the receiving processor has picked up the data. A prototype is also provided 
to send a block of memory. This is implemented as a DMA operation in order to 
obtain high performance when transferring large blocks of data. 

□ Non-Blocking Send: This operates like the blocking send except that the 
function returns immediately if the sent data can be held in a FIFO within the 
channel. Thus return may be made before the receiving processor has actually 
read the data This allows an improved level of decoupling between the 
operations of different processors. If the buffer is too full to accept the data then 
an error code is returned. 

□ Receive: Receives a block of date from a particular input channel. Various 
prototypes of tee function are supplied in order to pass different numbers of 
parameters and with different types. The prototype used must be identical to that 
used for the corresponding send. If it is not then the function can either throw an 
exception or return an error value. The receive blocks until data is available A 
prototype is also provided to load a block of memory. This is implemented as a 
DMA operation in order to obtain high performance when transferring large blocks 
of data. 

□ Poll: Checks" to determine if there are incoming messages on a particular 
channel. Allows a processor to poll without blocking if it is waiting to receive date 
from another processor. 

11.3.2 Message Queuing 

!!L°JS" to * upp ° rt non -°'°cking sends, FIFOs may be setup within tee communication 
channels. They decouple the interface between the processors, allowing a sendinq 
processor to continue operating after data has been sent to a channel. This allows more 
effective utilisation of the processing resources when the time required for particular 
blocks of computation exhibits large variability. 

An example FIFO structure is shown in the processor configuration below: 



Co-Processor 
A 



FFOs 



Co-Processor 
B 



FIFOs 



FIFOs 



Co-Processor 
C 



Co-Processor 
D 



FIFOS 



In this example FIFOs are shown in both communication directions. In many cases full 
sets of FIFOs will not be required. The size of the individual FIFOs is determined bv 
system modelling. y 
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11 A Remote Function Call 

As discussed previously, applications are partitioned between the main processor and 
CnticalBIue co-processor(s). a set of remote function definitions shows which functions on 
the co-processor may be called from the main processor. A remote call handler generator 
program is then used which reads that information and automatically generates interface 
code. A class definition in a .h file is generated that is compiled as part of the application 
on the main processor. A corresponding .c file is generated that is compiled as part of the 
co-processor application. 

This process is illustrated in the diagram below for the example used previously: 



Mafn Processor .h File 



class type_coprocl { 
int fund (int x) { 
send(D, x) ; 
return receive (); 

void func2 (int x, short y) { 
^ send(l, x, y) ; 

} coprocl; 



Remote Call Handler Generation 



Co-Processor .c File 



Auto Generated Files 
User Modified Files 
Remote Function Calls (Main) 



main ( ) { 




int x; 




short y; 




while [I) 


{ 


receive (type) ; 


switch (type) { 


case C 


] .-receive (&x) ; 




send (f unci (x) ) ; 




break; 


case 3 


L; receive (fix, &y) ; 




func2(x, y) ; 




break; 


> 




> 









x= coprocl .fund (yj ; ^ 
coprocl. f unc2 (y, z) ; ~ 



Function Implementations (Co-Processor) 



int fund (int x) 

void f unc2 (±nt x, short y) ; 



The main processor .h defines a class with wrapper functions for each of the functions 
that are being exported by the co-processor. These are generated using the remote 
function definition names and type information. Thus the number and types of the 
parameters to the function will correspond to those in the co-processor application The 
wrappers send a code value representing the function to be called along with the 
appropriate parameters. Finally any result from the function is received. This allows the 
semantics of the function call to be implemented on the main processor by using a remote 
function call mechanism. 

The code generated for the co-processor corresponds to the wrapper generated on the 
main processor.. Firstly, this reads the function number. A case statement or table can 
then be used to branch to the specific code for handling a particular function This 
receives the parameter information from the main processor. The appropriate function call 
is then made. Finally, if the function returns a result then the value is transmitted to the 
main processor so that it can be returned to the call point in the main program. 

Using this mechanism the function implementations running on the co-processor do not 
need to be modified. The function calls on the main processor only need to be modified in 
as much as they need to be made via the automatically generated remote function call 
class. The function names and the numbers and types of parameters remain identical 
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11.5 Shared Memory Architectures 

In a simple system all communication between processors can be performed using 
message passing. In that case each co-processor has its own private memory. Thus it is 
not possible to pass pointer values between the processors. Complex data structures 
have to remain within a single processor address space. 

In many cases it is desirable to provide a shared memory environment where the co- 
processors can access the same address space as the main processor. This allows 
pointers to be freely passed between the two environments and complex data structures 
to be shared. 



Providing a shared memory environment adds considerable hardware complexity, as 
caches are required within the co-processor that must remain coherent with contents of 
other caches in the system and main memory. 

11.5.1 Shared Physical Memory 

A shared physical memory system represents the simplest implementation. In this case 
the CriticalBlue co-processors only use physical addresses, avoiding the need for any 
memory management or a TLB. In many embedded systems there is no requirement to 
use virtual memory or virtual memory is mapped to correspond directly to the physical 
addresses. In such an environment this mechanism is ideal. 

An example shared physical memory layout is shown in the diagram below: 
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It is expected that the main processor contains a cache. For good performance such 
caches normally include use a write-back rather than a write-through caching mechanism. 
Thus data that has been updated is only written back to main memory when the cache 
line needs to be evicted. 

Each of the co-processors must be capable of acting as a bus master in order to read and 
write data from the main memory. Each of the co-processors must also be capable of 
supporting the cache coherency protocol support on the main processor bus. This 
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protocol is normally related to the MESI system that ensures that a particular cache line 
can only be wntten to by one particular processor in the system at any one time. 

The <x>processors also contain write-back caches that are physically indexed. If there is a 
cache miss then the pipeline of the co-processor is stalled. The bus master/slave unit is 
then used to access the main memory to obtain the required data. 

11.5.2 Shared Virtual Memory 

A more complex arrangement allows sharing of memory in a virtual memory environment 
The configuration is largely similar to sharing of physical memory except that each 
individual co-processor also needs to contain a TLB (Translation Lookaside Buffer) in 
order to rapidly translate virtual page addresses into physical page addresses. 

An example configuration is shown in the diagram below: 
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A virtual address is supplied to the cache within the co-processor. The lower bits of the 
address (which index within a single page) are then used to access the memory array In 
parallel a translation is performed in the TLB from the virtual page address to the physical 
page address. This is then used to match against physical tags associated with the 
cache. If there is a cache miss then the bus master/slave unit is used in the same manner 
to obtain data from the main memory using the physical address. 

A shared virtual memory system also requires management of the entries within the TLB 
n this configuration it is assumed that the main processor is running an operating system 
that determines when to page in and out particular blocks of virtual memory in the 
physical address space. Whenever there is a miss in the TLB of a co-processor an 
interrupt to the mam processor is generated. This causes the required virtual page to be 
ooked up in the page tables (bringing the data in from secondary storage if required) and 
the physical page address transmitted to the co-processor where it is stored for ftjture 
usage in the TLB. Moreover, if a physical page is ever reclaimed by the operating system 
for use by another virtual page then the corresponding entries in all co-processor TLBs 
must be evicted. This is done using a broadcast message sent from the main processor 
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Thus this mechanism requires changes to be made to the memory management handlina 
routines within the kernel of the operating system. 

In general TLB misses and potentially cache misses should only be generated by 
operations executing in the lowest numbered active strand. This prevents false requests 
being generated by later strands executing speculative operations that contain invalid 
addresses. If the addresses are valid then the execution will eventually reach the first 
executing strand. 



12 Write Speculation 

12.1 Principle 

Write speculation allows memory store operations to be performed during the speculative 
phase of a strand. This is not normally possible as only operations that do not 
permanently change the machine state can be issued. By allowing writes to be issued in 
the speculative phase, the scheduling freedom of the code generation is greatly 
increased. This allows more effective and balanced use of resources and greater 
parallelism. The number of check hazard operations that need to be performed is 
reduced, as stores can be moved earlier in the schedule and reduce the need to move 
loads earlier than those stores. 

s P^ u,ation uses a dat a buffer associated with memory units in the processor This 
buffer holds data that has been written to memory until it is established whether it will be 
committed or not. If the strand from which the data was written is aborted then the 
associated items in the buffer can simply be discarded. This there is no permanent effect 
on the machine state. 

The actual writes to main memory occur while speculative writes, from subsequently 
executed regions, are being transferred into the buffer. Thus on average no additional 
memory bandwidth is required when using this mechanism. Stores are simply delayed on 
their transport to the actual memory system. 

12.2 Register Write Speculation 

Shadow register files. Registers are committed upon commit for strand (registers tagged 
with strand number). Shadow copies restored at start of region. Thus speculative register 
writes can be performed during speculative strand phase. 

12.3 Memory Unit Incorporating SWB 

The diagram below shows a memory unit that incorporates a SWB. The memory array 
itself shown in the centre of the diagram is identical to that found in a standard memory 
unit. It may be a cache memory if required. 
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A multiplexer in the read data path selects between data from the memory array and from 
the buffer. This allows read bypassing if the most recent copy of a memory location is 
being held in the SWB. All writes are initially directed to the SWB. During those cycles the 
SWB control logic is able to direct previously written entries in the SWB to the main 
memory array. 

12-4 SWB Structure 

The Speculative Write Buffer (SWB) consists of two separate units: 

□ Write Address CAM (WAC): This is the actual speculative buffer that holds the 
data to be written, along with its address and other attributes 



□ 



Memory Write Controller (MWC): This is a controilenfor organizing the delayed 
write back of data from the WAC into the main memory. The MWC also detects 
read bypass situations and ensures that the correct data from the WAC is made 
available as the return data. 



12.5 Write Address CAM (WAC) 

The structure of the WAC is shown below. It consists of a number of individual entries 
used for holding speculative data. 
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There are three methods by which a particular entry may be addressed. Firstly, during a 
read the WAC may be accessed an associative lookup via the address field of each entry 
This allows speculative writes to the same address to be located. Secondly, the entries 
may be addressed using a write pointer. This addressing is used when writing new 
entries. Finally, the entries may be addressed using a drain pointer. This is used when 
reading the data from entries prior to committing them to memory. 

Each entry contains a strand comparator. This is used to determine if a strand being read 
is the same strand or later than an entry in the WAC. If there are multiple such entries in 
the WAC then there is a multi-hit detection. This generates a hazard causing the read 
strand to be aborted. 

The purpose of the individual fields in each entry is as follows: 
12.5.1 Address Field 

This field holds the address of the written data item. If virtual memory is being used then 
this holds a virtual address. 



12.5.2 Subwords Field 

This field holds enables for each subword that was written. Subword writes are only 
supported for the main memory unit. This memory unit has a width of 32 bits and is byte 
writable. Thus there are 4 bits representing which bytes within the word were written. 

12.5.3 State Field 

Each entry may have one of three individual states. This field holds the current state. The 
possible states are as follows: 

□ Empty: Indicates that the entry is unused. All entries are initially unused. Entries 
enter the unused state if they previously held entries that were aborted or pending 
entries that have been written to memory. 

□ Probationary: Indicates that the entry holds a speculative write entry. When an 
entry is first written to the WAC it is in this state. 
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□ Pending: Indicates that the entry holds a write that has been committed. The 
entry is pending to be drained to main memory. 

The possible state transitions are shown below: 



12.5.4 Data Field 

Holds the data that was written and will be written to main memory if the entry is 
committed. 

12.5.5 Cache Set Field 

This field is only required if the main memory is a cache. It shows which particular set 
holds the line containing the data to be written. The cache set information is obtained 
when the cache tags are looked up when the write is initially performed. The number of 
the set is preserved in this field. If the item is not present in the cache then the miss 
handling hardware is initiated. 

If the entry is eventually committed to the memory then this field is used to direct the write 
to the correct set without having to read the tags a second time. Cache lines are not 
evicted by the cache miss handling logic until all pending entries have been written back. 
This ensures that the line will be present in the same position in the cache between the 
speculative write and the final write back to memory. 

12.5.6 Strand Field 

The strand field holds the number of the strand that performed the write. This is used to 
determine whether the strand should be committed or aborted. If the write belongs to a 
committed strand then a probationary entry enters the pending state. If the write belongs 
to an aborted strand then the probationary entry is discarded. The strand field is also used 
by reads that are aliased with previous writes. If the strand number of the read is the 
same or higher than that of the write then a read bypass is performed so that the read 




Region succession if 
entry from a enabled 
L strand in SEM 
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obtains the speculatively written data. If the strand number is lower then the read is 
logically earlier than the write so the current value in memory is used. 

This field is only valid for probationary entries. 



12.6 SWB New Entry Selection 

New entries in the WAC are allocated in a round robin fashion. Two pointers manage the 
process. A write pointer shows the position that the next entry will be written to. A drain 
pointer shows the position that the next entry to be written back to memory will be read 
from. 

If the write pointer reaches the same position as the drain pointer then a pipeline stall is 
signaled. Pending entries can then be drained from the WAC until there is sufficient free 
space to continue again. The total number of writes in each region never exceeds the 
total number of entries in the WAC. Thus there is always sufficient space to complete the 
region, assuming pending entries from previous regions are drained as required. 

The WAC allocation policy is illustrated in the diagram below: 
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12.7 SWB Draining 

Pending writes are held in the WAC until they can be committed to memory. A pending 
write may be committed on each non-read access to the memory. Writes only require 
access to the cache tags (if the memory unit is a cache). Thus on each unused or 
speculative write cycle another entry is drained to the memory. 

In the steady state data can thus be drained at least as quickly as it is written. There may 
be variations in the numbers of writes between successive regions but the buffering of the 
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WAC will normally hide these. In the worst case the pipeline may be stalled to allow more 
entries to be drained. In the case of a tight loop the number of stores will be the same on 
each iteration so the steady state is achieved. 

When the drain entry reaches the write pointer the WAC is empty. Draining continues until 
this condition is met or a probationary entry is reached. 

12.8 Use of Pending Signal 

Each memory unit containing a SWB has a pending output. This is asserted while the 
SWB contains unwritten pending items. This signal is used by any cache miss handler 
hardware for the memory unit. The miss handler cannot evict any line from the cache until 
the pending output is de-asserted. This is because one of the pending items may be a 
member of that cache line. The SWB draining does not read the cache tags to check that 
the entry is correct. It relies upon the line remaining in the cache between the speculative 
write and its final commitment to memory. Thus all cache misses force a synchronization 
between the SWB and the memory. 

12.9 Read Bypassing 

Read bypassing allows reads to obtain a value from the SWB before it has been 
committed to main memory. This preserves the correct semantics of the interaction 
between reads and writes to memory. A read issued after a write should obtain the written 
value if the addresses are identical and the strand number of the read is the same or 
greater than the write. 

Whenever a read is performed an associative lookup is performed on the SWB, matching 
against the address. If the match is against a pending entry then no strand match is 
required as the entry must have been written in an earlier region. The data from the entry 
is forwarded as the result of the read. If the match is against a probationary entry then a 
strand comparison is also performed. The data is only forwarded if the read is from the 
same or higher numbered strand. 

In some cases a read may hit a subword write entry. In that case the new data in the entry 
is merged with existing data in the memory. This produces the correct merged form of the 
data word. 



12.10 Multi-Hit Hazards 

Writes that have been performed to the SWB are drained to the memory unit in their 
original order. Thus multiple writes to the same location that may be simultaneously in the 
SWB are handled correctly. The later write overwrites the any earlier ones. 

However, it is possible that a read might be performed while there is more than one write 
to the read location held in the SWB. To determine which value should be returned to the 
read would require complex prioritization logic. To avoid the need for such logic this 
situation is not handled directly. Instead, the strand causing the read is aborted. This is 
done by asserting a hazard* in the same manner that a check hazard operation would. 
The reading strand is aborted and the region is then re-executing. Once the region is 
restarted the previous writes are either committed or aborted. If they are committed then 
they are eventually drained to main memory and the hazard causing condition will 
eventually disappear. 

This implementation imposes some constraints on the code generation. Firstly, any read 
that might access a word that has been written more than once in the region can 
potentially generate a hazard. Thus such a read must appear in the speculative phase of 
its strand. The alias analysis in the code generation can determine which reads are 
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affected. Secondly, a read cannot be scheduled in the same strand as potentially multiple 
writes to the same location. If the read strand is aborted then so will the writes and the 
condition will keep reappearing on every ne-execution of the region. Again, alias analysis 
in the code generation ensures that this scenario cannot occur. 



12.11 Speculative Write Examples 

This section provides examples of how the speculative write mechanism operates. 
12.11.1 RAW Resolution Example 

This example demonstrates how the speculative write mechanism resolves RAW 
hazards. A RAW dependency indicates that a read operation should obtain the value 
written by an earlier write operation. However, in many cases it is not possible to 
determine before execution time whether any particular read will access a location that 
has been previously written to in the region. This means that whenever a read is 
performed a check must be made to determine if it is to a location that has been 
previously written to and is being held as probationary data in the SWB. 

The RAW resolution is illustrated in the following diagram: 



[SI; write y^3 - 
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Initially x = 1, y = 2 

Write creates a new SWB entry for 
the address y - data Is not written 
to memory immediately ' 
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Write creates a second SWB entry 
for the address x - data is not 
written to memory immediately 



S4: read y 
Result=3 - 
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A read from y Is performed. Because the 
read strand is the same or higher than 
the strand in which the data was written 
the data from the SWB is obtained. 



SO: read x 
Result=l 
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A read from x is performed. Because the 
read strand Is lower (i.e. earlier) than the 
writing strand the data Is obtained from 
main memory. 



When the write to address y occurs a probationary entry is created in the SWB. A 
following write to address x also creates a probationary entry. When a read of y is 
performed the data is returned from the SWB rather than main memory. This is because 
the read has occurred after the write and is from a higher (i.e. temporally later) strand. 
Even though the write value has yet to reach the memory (and may never reach memory 
if the writing strand is squashed) it represents what is logically held in memory for the read 
operation. When strand 0 reads address x, however, the value returned is not from the 
SWB but from the actual memory contents. This is because strand 0 is temporally earlier 
than the writing strand. Thus although the write had already physically occurred it is 
logically later than the read so cannot pass data to it. 

12.11.2 Multi-Hit Hazard Example 

This example shows how the SWB handles a multi-hit hazard. Such a hazard occurs if a 
read detects that the required location has been written to more than once by previous 
writes within the region. The ordering of the writes determines the order in which they are 
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ultimately written into main memory. However, the value that should be obtained for a 
read is dependent on the strand numbers of the read and the writes. Since this is complex 
to prioritise the system generates a multi-hit hazard. 

The diagram below shows a multi-hit hazard example: 
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Both strands 1 and 2 perform writes of address y. When strand 2 subsequently reads 
from y it detects two different probationary writes to the address y. This causes a multi-hit . 
hazard that aborts strand 2 and all higher strands. When the end of the region is reached 
only those writes from strands that have not been aborted are changed to pending status. 
All other probationary writes are cleared. 

Since an abort has been generated the same region is re-executed. At the start of the 
region only the write to address y from strand 1 is made pending. This pending value is 
immediately used to update the memory contents. During the re-execution of the region 
operations from strand 1 are not executed as the strand has been previously completed. 
When the read of address y in strand 2 occurs again only a single probationary write for 
address y is present in the SWB. Thus the value can be forwarded to form the read result. 
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